{ "cells": [ { "cell_type": "markdown", "id": "75adde04", "metadata": {}, "source": [ "# Python Programming Language for Data Analysis\n", "\n", "In our earlier workshop we learned how to import third party libraries such as `pandas` and use it to analyze data. In the process, we learned many fundamental aspects of programming such as:\n", "\n", " - Variables and Data Types\n", " - Operators\n", " - Functions (User-defined functions, built-in functions, methods and third party functions)\n", " - Indexing and Extracting elements from a sequence\n", " \n", "We also learned how to use many core aspects of `pandas` library:\n", "\n", " - How to import data in a csv file and get summary statistics for numeric and non-numeric columns\n", " - How to list functions available in pandas modules and review its use by consulting documentations online\n", " - How to filter rows and columns to get a desired subset of data\n", " - How to create new columns with desired values\n", " - How to group data based on one or multiple columns and get group-wise summary statistics\n", " - How to plot data to visualize trends over time\n", " \n", "In this workshop, we will now use this knowledge to perform end-to-end data analysis. First, we will begin by answering the questions we have already solved, so that we can practice what we know. Then we will focus on how to use this data such that we can create a simple model that can predict XXX the ridership of the 79th route. For the latter we will utilize the `sklearn` package for machine learning in Python.\n", "\n", "Let's begin by importing the libraries and data." ] }, { "cell_type": "code", "execution_count": 1, "id": "2ba3ce53", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "import matplotlib.pyplot as plt\n", "from matplotlib.ticker import MaxNLocator " ] }, { "cell_type": "code", "execution_count": 2, "id": "08f33900", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
routeroutenameMonth_BeginningAvg_Weekday_RidesAvg_Saturday_RidesAvg_Sunday-Holiday_RidesMonthTotal
01Indiana/Hyde Park01/01/20016982.60.00.0153617
12Hyde Park Express01/01/20011000.00.00.022001
23King Drive01/01/200121406.513210.78725.3567413
34Cottage Grove01/01/200122432.217994.010662.2618796
46Jackson Park Express01/01/200118443.013088.27165.6493926
\n", "
" ], "text/plain": [ " route routename Month_Beginning Avg_Weekday_Rides \\\n", "0 1 Indiana/Hyde Park 01/01/2001 6982.6 \n", "1 2 Hyde Park Express 01/01/2001 1000.0 \n", "2 3 King Drive 01/01/2001 21406.5 \n", "3 4 Cottage Grove 01/01/2001 22432.2 \n", "4 6 Jackson Park Express 01/01/2001 18443.0 \n", "\n", " Avg_Saturday_Rides Avg_Sunday-Holiday_Rides MonthTotal \n", "0 0.0 0.0 153617 \n", "1 0.0 0.0 22001 \n", "2 13210.7 8725.3 567413 \n", "3 17994.0 10662.2 618796 \n", "4 13088.2 7165.6 493926 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv(filepath_or_buffer=\"cta-ridership-original.csv\")\n", "data.head()" ] }, { "cell_type": "markdown", "id": "ae3d72a1", "metadata": {}, "source": [ "Let's answer the following questions first:\n", "1. Identify the 10 routes with highest number of ridership in total. Create a bar plot of total ridership of these top 10 routes. To create `bar` plot simple provide argument `bar` to the parameter `kind` of the `plot` method. \n", "2. Which route has the highest average ridership? Is it also the most popular route on Saturdays or on Sundays and Holidays? Why is the route so popular?\n", "3. Group the data by year to figure out the yearly average trend of ridership over the years. Plot the yearly average of the average monthly total ridership value.\n", "4. Now use the above grouped data to plot the average ridership during the weekdays, saturday and sunday/holidays by year.\n", "5. Which routes have the highest difference in average ridership between weekdays and Saturdays?\n", "6. Which routes have the highest difference in average ridership between weekdays and Sundays/Holidays?\n", "7. Which routes have the most consistent average ridership between weekdays, Saturdays and Sundays/Holidays? i.e. are there any route that are not affected by the day of the week?" ] }, { "cell_type": "code", "execution_count": 3, "id": "267e63a9", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# 1. Identify the 10 routes with highest number of ridership in total. \n", "# Create a bar plot of total ridership of these top 10 routes. \n", "\n", "routes_grouped = data[['routename','MonthTotal']].groupby('routename') # created groupby object\n", "monthtotal_byroutes = routes_grouped.sum() # get sum for each group in groupby object \n", "top10routes = monthtotal_byroutes.sort_values(by='MonthTotal', ascending=False)[:10] # sort and get the first 10 values\n", "\n", "top10routes.plot(kind='bar') # create a bar plot of the series with top 10 values along with the index \n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 4, "id": "2e143b31", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Avg_Weekday_Rides\n", "Avg_Saturday_Rides\n", "Avg_Sunday-Holiday_Rides\n" ] } ], "source": [ "# 2. Which route has the highest average ridership? \n", "# Is it also the most popular route on Saturdays or on Sundays and Holidays? \n", "# Why is the route so popular?\n", "\n", "\n", "cols = [ 'Avg_Weekday_Rides','Avg_Saturday_Rides', 'Avg_Sunday-Holiday_Rides'] # create a list of columns to use\n", "\n", "top10routes_byday = [] # initialize empty list to store results\n", "for i in cols: # iterate over each item of the list\n", " print(i) # print the item\n", " routes_grouped = data[['routename', i]].groupby('routename') # ceate group byobject with routename and item in the loop\n", " total_byroutes = routes_grouped.sum() # get sum of each group\n", " top10routes = total_byroutes.sort_values(by=i, ascending=False)[:10] # sort and extract the first 10 items\n", " top10routes_byday.append(top10routes) # append the above result to the empty list" ] }, { "cell_type": "code", "execution_count": 5, "id": "9ca047b8", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "for i in top10routes_byday: # for each item in the list that now has result\n", " i.plot(kind='bar') # create a bar plot " ] }, { "cell_type": "code", "execution_count": 6, "id": "4470c08b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Avg_Weekday_RidesAvg_Saturday_RidesAvg_Sunday-Holiday_RidesMonthTotal
Month_Beginning_year
20017289.0619174573.7561163036.292182189227.727617
20027206.8502794548.0555492994.784191187113.878487
20036787.3618414260.9481532798.121381175961.016959
20046539.2253504221.4717872779.293575170907.525701
20056677.6925824212.1672802876.988787173637.042553
\n", "
" ], "text/plain": [ " Avg_Weekday_Rides Avg_Saturday_Rides \\\n", "Month_Beginning_year \n", "2001 7289.061917 4573.756116 \n", "2002 7206.850279 4548.055549 \n", "2003 6787.361841 4260.948153 \n", "2004 6539.225350 4221.471787 \n", "2005 6677.692582 4212.167280 \n", "\n", " Avg_Sunday-Holiday_Rides MonthTotal \n", "Month_Beginning_year \n", "2001 3036.292182 189227.727617 \n", "2002 2994.784191 187113.878487 \n", "2003 2798.121381 175961.016959 \n", "2004 2779.293575 170907.525701 \n", "2005 2876.988787 173637.042553 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 4. Group the data by year to figure out the yearly average trend of ridership over the years. \n", "# Plot the yearly average of the average monthly total ridership value.\n", "\n", "\n", "data['Month_Beginning'] = pd.to_datetime(data['Month_Beginning'], format='%m/%d/%Y') # convert to datetime object\n", "data['Month_Beginning_year'] = data['Month_Beginning'].dt.year # create new column with just year info\n", "\n", "yearly_groups = data.iloc[:,3:8].groupby('Month_Beginning_year').mean() # select a subset of data and create groupby object and calculate mean of each group\n", "yearly_groups.head() # show first five lines of the above dataframe" ] }, { "cell_type": "code", "execution_count": 7, "id": "dc97a032", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots() # create plotting objects\n", "yearly_groups['MonthTotal'].plot(ax=ax) # plot a series of a dataframe and attach its axis to the plot object above\n", "ax.xaxis.set_major_locator(MaxNLocator(integer=True)) # set x-axis labels to integer\n", "ax.set_title('Monthly Total Ridership on CTA busses from 2002 to 2018', # set title for the plot\n", " fontsize = 14) \n", "plt.show() # display the plot" ] }, { "cell_type": "code", "execution_count": 8, "id": "d1645cc4", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# 5. Now use the above grouped data to plot the average ridership during the weekdays, saturday and \n", "# sunday/holidays by year.\n", "\n", "\n", "fig, ax = plt.subplots()\n", "yearly_groups.iloc[:,:-1].plot(ax=ax)\n", "ax.xaxis.set_major_locator(MaxNLocator(integer=True))\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 9, "id": "21772ec7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
routeroutenameMonth_BeginningAvg_Weekday_RidesAvg_Saturday_RidesAvg_Sunday-Holiday_RidesMonthTotalMonth_Beginning_yeardiff_week_saturdaydiff_week_sundaydiff_sat_sunday
01Indiana/Hyde Park2001-01-016982.60.00.015361720016982.66982.60.0
12Hyde Park Express2001-01-011000.00.00.02200120011000.01000.00.0
23King Drive2001-01-0121406.513210.78725.356741320018195.812681.24485.4
34Cottage Grove2001-01-0122432.217994.010662.261879620014438.211770.07331.8
46Jackson Park Express2001-01-0118443.013088.27165.649392620015354.811277.45922.6
\n", "
" ], "text/plain": [ " route routename Month_Beginning Avg_Weekday_Rides \\\n", "0 1 Indiana/Hyde Park 2001-01-01 6982.6 \n", "1 2 Hyde Park Express 2001-01-01 1000.0 \n", "2 3 King Drive 2001-01-01 21406.5 \n", "3 4 Cottage Grove 2001-01-01 22432.2 \n", "4 6 Jackson Park Express 2001-01-01 18443.0 \n", "\n", " Avg_Saturday_Rides Avg_Sunday-Holiday_Rides MonthTotal \\\n", "0 0.0 0.0 153617 \n", "1 0.0 0.0 22001 \n", "2 13210.7 8725.3 567413 \n", "3 17994.0 10662.2 618796 \n", "4 13088.2 7165.6 493926 \n", "\n", " Month_Beginning_year diff_week_saturday diff_week_sunday diff_sat_sunday \n", "0 2001 6982.6 6982.6 0.0 \n", "1 2001 1000.0 1000.0 0.0 \n", "2 2001 8195.8 12681.2 4485.4 \n", "3 2001 4438.2 11770.0 7331.8 \n", "4 2001 5354.8 11277.4 5922.6 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 6. Which routes have the highest difference in average ridership between weekdays and Saturdays? \n", "# 7. Which routes have the highest difference in average ridership between weekdays and Sundays/Holidays?\n", "\n", "\n", "# create new columns that store differences in ridership among day types\n", "data['diff_week_saturday'] = data['Avg_Weekday_Rides'] - data['Avg_Saturday_Rides'] # get difference of two columns of a dataframe\n", "data['diff_week_sunday'] = data['Avg_Weekday_Rides'] - data['Avg_Sunday-Holiday_Rides']\n", "data['diff_sat_sunday'] = data['Avg_Saturday_Rides'] - data['Avg_Sunday-Holiday_Rides']\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 10, "id": "4384b022", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "diff_week_saturday\n", "diff_week_sunday\n", "diff_sat_sunday\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
diff_week_saturdaydiff_week_sundaydiff_sat_sunday
routename
111th/King Drive104286.6142535.038248.4
16th/18th205568.8269088.863520.0
31st12376.012376.00.0
31st/35th159550.3213529.353979.0
35th340682.7520450.8179768.1
\n", "
" ], "text/plain": [ " diff_week_saturday diff_week_sunday diff_sat_sunday\n", "routename \n", "111th/King Drive 104286.6 142535.0 38248.4\n", "16th/18th 205568.8 269088.8 63520.0\n", "31st 12376.0 12376.0 0.0\n", "31st/35th 159550.3 213529.3 53979.0\n", "35th 340682.7 520450.8 179768.1" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select columns to work with\n", "cols = [ 'diff_week_saturday','diff_week_sunday', 'diff_sat_sunday']\n", "\n", "# create grouped dataframes for each day type and store results in a list\n", "total_byroutes = []\n", "for i in cols:\n", " print(i)\n", " routes_grouped = data[['routename', i]].groupby('routename')\n", " total_byroutes.append(routes_grouped.sum())\n", " \n", "totaldiff_byroutes = pd.concat(total_byroutes, axis=1) # convert a list of dataframe to single dataframe\n", "totaldiff_byroutes.head()" ] }, { "cell_type": "code", "execution_count": 11, "id": "7146fdf1", "metadata": {}, "outputs": [], "source": [ "# create a function to get the top or bottom N items of a column with its index\n", "\n", "def get_n_items(diff_col, N=10, sort_asc=True):\n", " if sort_asc:\n", " get_n = diff_col.abs().sort_values()[:N] # get absolute value and then sort in ascending order and get first N items\n", " else:\n", " get_n = diff_col.abs().sort_values(ascending=False)[:N] # get absolute value and then sort in descending order and get first N items\n", " \n", " return get_n" ] }, { "cell_type": "code", "execution_count": 12, "id": "465c1638", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
diff_week_saturdaydiff_week_sundaydiff_sat_sunday
79th1548851.23008064.01459212.8
AshlandNaN2246509.61295046.1
Belmont1462054.12406277.3944223.2
Chicago1634425.22666510.91032085.7
ClarkNaNNaN874908.1
Cottage Grove1346034.32344181.8998147.5
Halsted1654565.82457469.9NaN
Kimball-Homan1500209.42267138.1NaN
King DriveNaN2392445.71122547.2
LaSalle1799238.0NaNNaN
Madison1632773.12491969.9859196.8
Pulaski1369288.92296854.7927565.8
WesternNaNNaN1237324.4
Western Express1361169.4NaNNaN
\n", "
" ], "text/plain": [ " diff_week_saturday diff_week_sunday diff_sat_sunday\n", "79th 1548851.2 3008064.0 1459212.8\n", "Ashland NaN 2246509.6 1295046.1\n", "Belmont 1462054.1 2406277.3 944223.2\n", "Chicago 1634425.2 2666510.9 1032085.7\n", "Clark NaN NaN 874908.1\n", "Cottage Grove 1346034.3 2344181.8 998147.5\n", "Halsted 1654565.8 2457469.9 NaN\n", "Kimball-Homan 1500209.4 2267138.1 NaN\n", "King Drive NaN 2392445.7 1122547.2\n", "LaSalle 1799238.0 NaN NaN\n", "Madison 1632773.1 2491969.9 859196.8\n", "Pulaski 1369288.9 2296854.7 927565.8\n", "Western NaN NaN 1237324.4\n", "Western Express 1361169.4 NaN NaN" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# apply the function to all columns of the dataframe with difference information and \n", "# get top 10 routes\n", "top10route_bydiff = totaldiff_byroutes.apply(get_n_items, sort_asc=False)\n", "top10route_bydiff" ] }, { "cell_type": "code", "execution_count": 13, "id": "288c364e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
diff_week_saturdaydiff_week_sundaydiff_sat_sunday
69th-Garfield Express Shuttle2148.64346.6NaN
Central/Sherman926.51260.8NaN
Cermak-Roosevelt ExpressNaN1457.2NaN
Chicago Manufacturing Campus2594.12651.5NaN
Chicago/GolfNaNNaN0.0
Cicero ExpressNaNNaN0.0
Clarendon/LaSalle ExpressNaNNaN0.0
Clarendon/Michigan ExpressNaNNaN0.0
Dan Ryan OWL Shuttle1265.4162.0NaN
King Drive ExpressNaNNaN0.0
LaSalleNaNNaN0.0
Pershing Shuttle393.0350.0NaN
Pullman Shuttle1090.93754.5NaN
ROAD CALL0.01.0NaN
Ridge/GrantNaNNaN0.0
SedgwickNaNNaN0.0
Sheridan/LaSalle ExpressNaNNaN0.0
South Pulaski LimitedNaNNaN0.0
Special Dest Signs3.13.1NaN
Touhy Supplement8.788.1NaN
U. of Chicago/Lakeview Express2326.4NaNNaN
\n", "
" ], "text/plain": [ " diff_week_saturday diff_week_sunday \\\n", "69th-Garfield Express Shuttle 2148.6 4346.6 \n", "Central/Sherman 926.5 1260.8 \n", "Cermak-Roosevelt Express NaN 1457.2 \n", "Chicago Manufacturing Campus 2594.1 2651.5 \n", "Chicago/Golf NaN NaN \n", "Cicero Express NaN NaN \n", "Clarendon/LaSalle Express NaN NaN \n", "Clarendon/Michigan Express NaN NaN \n", "Dan Ryan OWL Shuttle 1265.4 162.0 \n", "King Drive Express NaN NaN \n", "LaSalle NaN NaN \n", "Pershing Shuttle 393.0 350.0 \n", "Pullman Shuttle 1090.9 3754.5 \n", "ROAD CALL 0.0 1.0 \n", "Ridge/Grant NaN NaN \n", "Sedgwick NaN NaN \n", "Sheridan/LaSalle Express NaN NaN \n", "South Pulaski Limited NaN NaN \n", "Special Dest Signs 3.1 3.1 \n", "Touhy Supplement 8.7 88.1 \n", "U. of Chicago/Lakeview Express 2326.4 NaN \n", "\n", " diff_sat_sunday \n", "69th-Garfield Express Shuttle NaN \n", "Central/Sherman NaN \n", "Cermak-Roosevelt Express NaN \n", "Chicago Manufacturing Campus NaN \n", "Chicago/Golf 0.0 \n", "Cicero Express 0.0 \n", "Clarendon/LaSalle Express 0.0 \n", "Clarendon/Michigan Express 0.0 \n", "Dan Ryan OWL Shuttle NaN \n", "King Drive Express 0.0 \n", "LaSalle 0.0 \n", "Pershing Shuttle NaN \n", "Pullman Shuttle NaN \n", "ROAD CALL NaN \n", "Ridge/Grant 0.0 \n", "Sedgwick 0.0 \n", "Sheridan/LaSalle Express 0.0 \n", "South Pulaski Limited 0.0 \n", "Special Dest Signs NaN \n", "Touhy Supplement NaN \n", "U. of Chicago/Lakeview Express NaN " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# apply the function to all columns of the dataframe with difference information and \n", "# get bottom 10 routes\n", "\n", "bottom10route_bydiff = totaldiff_byroutes.apply(get_n_items)\n", "bottom10route_bydiff " ] }, { "cell_type": "code", "execution_count": 14, "id": "57531e72", "metadata": {}, "outputs": [], "source": [ "# 8. Which routes have the most consistent average ridership between weekdays, Saturdays and Sundays/Holidays? \n", "# i.e. are there any route that are not affected by the day of the week?" ] }, { "cell_type": "code", "execution_count": 15, "id": "86cfd81c", "metadata": {}, "outputs": [], "source": [ "# Note that this questions relies on some assumption of what is considered to be \"consistent\"\n", "# For out demonstration purpose any absolute difference within 5000 will be considered consistent" ] }, { "cell_type": "code", "execution_count": 16, "id": "f448cac3", "metadata": {}, "outputs": [], "source": [ "# create a function that returns rows from a given col that have values less than N\n", "def consistency(col, N=5000):\n", " consistent_rows = col[col.abs()<=N].index\n", " return consistent_rows" ] }, { "cell_type": "code", "execution_count": 17, "id": "f70d205c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
diff_week_saturdaydiff_week_sundaydiff_sat_sunday
69th-Garfield Express Shuttle2148.64346.62198.0
Central/Sherman-926.5-1260.8-334.3
Cermak-Roosevelt Express-4163.2-1457.22706.0
Chicago Manufacturing Campus2594.12651.557.4
Chinatown/Pilsen Shuttle-4229.5NaN-1251.2
............
West 65thNaNNaN0.0
West CermakNaNNaN1.5
West Loop/South LoopNaNNaN0.0
WestchesterNaNNaN0.0
Western ExpressNaNNaN-203.8
\n", "

76 rows × 3 columns

\n", "
" ], "text/plain": [ " diff_week_saturday diff_week_sunday \\\n", "69th-Garfield Express Shuttle 2148.6 4346.6 \n", "Central/Sherman -926.5 -1260.8 \n", "Cermak-Roosevelt Express -4163.2 -1457.2 \n", "Chicago Manufacturing Campus 2594.1 2651.5 \n", "Chinatown/Pilsen Shuttle -4229.5 NaN \n", "... ... ... \n", "West 65th NaN NaN \n", "West Cermak NaN NaN \n", "West Loop/South Loop NaN NaN \n", "Westchester NaN NaN \n", "Western Express NaN NaN \n", "\n", " diff_sat_sunday \n", "69th-Garfield Express Shuttle 2198.0 \n", "Central/Sherman -334.3 \n", "Cermak-Roosevelt Express 2706.0 \n", "Chicago Manufacturing Campus 57.4 \n", "Chinatown/Pilsen Shuttle -1251.2 \n", "... ... \n", "West 65th 0.0 \n", "West Cermak 1.5 \n", "West Loop/South Loop 0.0 \n", "Westchester 0.0 \n", "Western Express -203.8 \n", "\n", "[76 rows x 3 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# initialize list to store results\n", "consistent_routes = [] \n", "\n", "# no. of columns \n", "ncol = totaldiff_byroutes.shape[1] \n", "\n", "# iterate over no. of columns and columns names\n", "for i,j in zip(range(0,ncol) , totaldiff_byroutes.columns):\n", " \n", " # apply the above consistency function on all columns of the dataframe with difference values\n", " consistent_index = totaldiff_byroutes.apply(consistency)\n", " \n", " # take result for one column at a time and append to the initialized list\n", " consistent_routes.append(totaldiff_byroutes[j].loc[consistent_index[i]])\n", "\n", "# convert a list of series into dataframe \n", "consistent_routes = pd.concat(consistent_routes, axis=1)\n", "consistent_routes" ] }, { "cell_type": "markdown", "id": "c20ba562", "metadata": {}, "source": [ "### Pivot Table\n", "\n", "The popular Pivot Table feature of Excel can also be replicated in `pandas` using the `pivot_table` method. The method takes the data to pivot, the column in the data that should be the row of the pivot table, the column(s) that will be the columns of the pivot table, the value to aggregate and the function to use for aggregation. " ] }, { "cell_type": "code", "execution_count": 18, "id": "b013eb67", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MonthTotal
Month_Beginning_month123456789101112
Month_Beginning_year
2001191682.569231180008.114504202056.333333189006.106061200182.601504187035.924812185020.507576184324.962406184282.171642201919.833333188562.340909176664.303030
2002186997.113636178375.167939187451.431818190126.901515195393.708955180937.801471184205.235294185049.191176193105.808824206207.705882181213.029412176199.044118
2003179165.701493168556.231343181676.970370179275.511111184536.043796171917.528571177907.066667169338.891304180159.774648191909.283688162295.535714164920.928571
2004163615.028777162704.900000181756.892857171816.617021171949.531469171766.465278166555.993007162190.280822179752.808219186020.576389171247.461538161263.629371
2005161472.569444165635.729167178743.951389178096.659722176994.794521172665.896552166595.317241172662.727891186046.816327190225.736111171809.524138162478.250000
2006159650.793103158469.262069181230.143836162986.267123177722.168919163903.105960158669.520270161644.835526169699.493421175796.307190167131.980000156727.589404
2007158595.183007142559.398693175356.618421162704.558442177779.316129164675.909091163282.642857165791.522581170422.019231187522.670968167717.136364155644.850649
2008161955.807947161156.598684172276.427632184973.980392187152.207792178615.116883187269.171053182051.230263189574.123377202679.627451171604.526316159266.828947
2009164051.013158164146.125000182435.556291174261.921053179606.483660171605.857143174963.254902167717.474026180726.227273190903.726667175525.610738164446.718121
2010168908.081081161130.597315192687.568345185029.442857181874.723404180938.428571178714.550000182715.707143189212.482270196662.543478179465.800000163862.827338
2011174595.215827163577.877698198571.268116182876.115108188321.935714187221.913669178593.742857189062.758865193413.929078198528.647482185121.028777178975.848921
2012178384.782609185859.107914201808.231884185649.057971193856.321429184229.050000180039.690647189745.794326186888.659574205264.492857186221.323741168258.085714
2013192483.614173184558.724409200014.779528202087.653543197548.592593183551.007407183128.437037184120.588235187021.789855199629.625000178684.570312167303.421875
2014160137.598425168318.484375190071.007812183523.700787190023.343750172813.085938173572.410853170717.868217188014.341085198291.804688164755.759690169979.132812
2015167105.401575162483.218750191411.456693183556.196850182984.210938179720.359375177210.328125172042.445312182426.038462192683.992248167170.775194163185.961538
2016158483.715385165692.449612178756.192308165915.323077170133.030769166340.553846157339.449612164210.046154171288.712121176018.630769162720.286822150495.625000
2017153340.905512154858.796875173420.598425157605.834646169127.325581162655.279070152702.403101159874.184615167603.223077175312.687500160717.687500146000.421875
2018150500.055118144171.562500164127.173228159991.511811165849.007752153758.015504152571.186047154002.290076159575.713178NaNNaNNaN
\n", "
" ], "text/plain": [ " MonthTotal \\\n", "Month_Beginning_month 1 2 3 \n", "Month_Beginning_year \n", "2001 191682.569231 180008.114504 202056.333333 \n", "2002 186997.113636 178375.167939 187451.431818 \n", "2003 179165.701493 168556.231343 181676.970370 \n", "2004 163615.028777 162704.900000 181756.892857 \n", "2005 161472.569444 165635.729167 178743.951389 \n", "2006 159650.793103 158469.262069 181230.143836 \n", "2007 158595.183007 142559.398693 175356.618421 \n", "2008 161955.807947 161156.598684 172276.427632 \n", "2009 164051.013158 164146.125000 182435.556291 \n", "2010 168908.081081 161130.597315 192687.568345 \n", "2011 174595.215827 163577.877698 198571.268116 \n", "2012 178384.782609 185859.107914 201808.231884 \n", "2013 192483.614173 184558.724409 200014.779528 \n", "2014 160137.598425 168318.484375 190071.007812 \n", "2015 167105.401575 162483.218750 191411.456693 \n", "2016 158483.715385 165692.449612 178756.192308 \n", "2017 153340.905512 154858.796875 173420.598425 \n", "2018 150500.055118 144171.562500 164127.173228 \n", "\n", " \\\n", "Month_Beginning_month 4 5 6 \n", "Month_Beginning_year \n", "2001 189006.106061 200182.601504 187035.924812 \n", "2002 190126.901515 195393.708955 180937.801471 \n", "2003 179275.511111 184536.043796 171917.528571 \n", "2004 171816.617021 171949.531469 171766.465278 \n", "2005 178096.659722 176994.794521 172665.896552 \n", "2006 162986.267123 177722.168919 163903.105960 \n", "2007 162704.558442 177779.316129 164675.909091 \n", "2008 184973.980392 187152.207792 178615.116883 \n", "2009 174261.921053 179606.483660 171605.857143 \n", "2010 185029.442857 181874.723404 180938.428571 \n", "2011 182876.115108 188321.935714 187221.913669 \n", "2012 185649.057971 193856.321429 184229.050000 \n", "2013 202087.653543 197548.592593 183551.007407 \n", "2014 183523.700787 190023.343750 172813.085938 \n", "2015 183556.196850 182984.210938 179720.359375 \n", "2016 165915.323077 170133.030769 166340.553846 \n", "2017 157605.834646 169127.325581 162655.279070 \n", "2018 159991.511811 165849.007752 153758.015504 \n", "\n", " \\\n", "Month_Beginning_month 7 8 9 \n", "Month_Beginning_year \n", "2001 185020.507576 184324.962406 184282.171642 \n", "2002 184205.235294 185049.191176 193105.808824 \n", "2003 177907.066667 169338.891304 180159.774648 \n", "2004 166555.993007 162190.280822 179752.808219 \n", "2005 166595.317241 172662.727891 186046.816327 \n", "2006 158669.520270 161644.835526 169699.493421 \n", "2007 163282.642857 165791.522581 170422.019231 \n", "2008 187269.171053 182051.230263 189574.123377 \n", "2009 174963.254902 167717.474026 180726.227273 \n", "2010 178714.550000 182715.707143 189212.482270 \n", "2011 178593.742857 189062.758865 193413.929078 \n", "2012 180039.690647 189745.794326 186888.659574 \n", "2013 183128.437037 184120.588235 187021.789855 \n", "2014 173572.410853 170717.868217 188014.341085 \n", "2015 177210.328125 172042.445312 182426.038462 \n", "2016 157339.449612 164210.046154 171288.712121 \n", "2017 152702.403101 159874.184615 167603.223077 \n", "2018 152571.186047 154002.290076 159575.713178 \n", "\n", " \n", "Month_Beginning_month 10 11 12 \n", "Month_Beginning_year \n", "2001 201919.833333 188562.340909 176664.303030 \n", "2002 206207.705882 181213.029412 176199.044118 \n", "2003 191909.283688 162295.535714 164920.928571 \n", "2004 186020.576389 171247.461538 161263.629371 \n", "2005 190225.736111 171809.524138 162478.250000 \n", "2006 175796.307190 167131.980000 156727.589404 \n", "2007 187522.670968 167717.136364 155644.850649 \n", "2008 202679.627451 171604.526316 159266.828947 \n", "2009 190903.726667 175525.610738 164446.718121 \n", "2010 196662.543478 179465.800000 163862.827338 \n", "2011 198528.647482 185121.028777 178975.848921 \n", "2012 205264.492857 186221.323741 168258.085714 \n", "2013 199629.625000 178684.570312 167303.421875 \n", "2014 198291.804688 164755.759690 169979.132812 \n", "2015 192683.992248 167170.775194 163185.961538 \n", "2016 176018.630769 162720.286822 150495.625000 \n", "2017 175312.687500 160717.687500 146000.421875 \n", "2018 NaN NaN NaN " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['Month_Beginning_month'] = data['Month_Beginning'].dt.month\n", "ridership_overtime = pd.pivot_table(data=data.iloc[:,1:], \n", " index=['Month_Beginning_year'], \n", " columns=['Month_Beginning_month'],\n", " values=['MonthTotal'],\n", " aggfunc=np.mean)\n", "ridership_overtime" ] }, { "cell_type": "markdown", "id": "27f70fe9", "metadata": {}, "source": [ "Additional features are available in the `pivot_table` function such as filling-in the missing values in the data, which is set to NaN by default. Refer to the [official documentation](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html) for usage. \n", "\n", "Now, with the above data, we can create a plot such that we can see the ridership trend of each month over the years. In order to automatically generate the names of the month we can use the `datetime` module" ] }, { "cell_type": "code", "execution_count": 19, "id": "220c7069", "metadata": {}, "outputs": [], "source": [ "import datetime as dt #move up" ] }, { "cell_type": "code", "execution_count": 20, "id": "c5a2ee60", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "months = [dt.date(2022, m, 1).strftime('%B') for m in range(1, 13)] # generate names of the months of a year\n", "fig, ax = plt.subplots(figsize=(15,8))\n", "ridership_overtime.plot(kind='line', style=['r*-','bo-'], ax=ax) # added style for some line so that we can get distinct lines\n", "plt.legend(months, ncol=3, loc='lower left', title='Month of the year',) # show legend in 3 columns\n", "ax.xaxis.set_major_locator(MaxNLocator(integer=True))\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "196f65cd", "metadata": {}, "source": [ "### Modelling the Data\n", "\n", "Now that we have explored the data and we can now focus on modelling the data. There are many ways to model the data and the best way really depends on the problem you are trying to solve. Let's say for example, we want to know if it is possible to use the information of the ridership on Weekdays and Saturdays to predict the ridership on Sundays and Holidays.\n", "\n", "**Note:** Obviously, ridership on Sundays and Holidays are dependent of many other factors besides ridership on Weekdays and Saturday. The modelling is for the purposed of demonstration only. \n", "\n", "The first part of modelling the data is that the data must be clean and any item in the data must be numeric. This is because machine learning models or statistical models in general do not take data that have textual values or missing values. So such data must be processed and transformed to some reasonable numeric representation before they are used in modelling. \n", "\n", "In our example, we will be using the two numeric columns and they have no missing data as identified above, we are ready to use this data. We will begin by separating the variables that will be predicted and the variables that will be use to predict. The former is commonly refered to as target (y) and the latter as features (X)." ] }, { "cell_type": "code", "execution_count": 21, "id": "0a74ef72", "metadata": {}, "outputs": [], "source": [ "X = data[['Avg_Weekday_Rides','Avg_Saturday_Rides']]\n", "y = data['Avg_Sunday-Holiday_Rides']" ] }, { "cell_type": "code", "execution_count": 22, "id": "484be1e5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Avg_Weekday_RidesAvg_Saturday_Rides
06982.60.0
11000.00.0
221406.513210.7
322432.217994.0
418443.013088.2
\n", "
" ], "text/plain": [ " Avg_Weekday_Rides Avg_Saturday_Rides\n", "0 6982.6 0.0\n", "1 1000.0 0.0\n", "2 21406.5 13210.7\n", "3 22432.2 17994.0\n", "4 18443.0 13088.2" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.head()" ] }, { "cell_type": "code", "execution_count": 23, "id": "8404dae1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.0\n", "1 0.0\n", "2 8725.3\n", "3 10662.2\n", "4 7165.6\n", "Name: Avg_Sunday-Holiday_Rides, dtype: float64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y[:5]" ] }, { "cell_type": "markdown", "id": "bde25300", "metadata": {}, "source": [ "Next, we will split the feature data into training and test set. This is while we use data to train a machine learning model, its performance should be reported on a data that the model has never seen before. This ensure that the model is able to generalize i.e. it has not just learned the training data very well but also some patterns that can help predict future data points. This is very important if we want to use the model in the real world.\n", "\n", "Usually a 70-30 or 80-20 split is recommended. In our case we will keep 3/4 of the data for training and 1/4 for testing. let's import the `train_test_split` function availablel via the `model_selection` submodule of `sklearn` library. This function takes the feature and target data and give us the desired splits of the data. " ] }, { "cell_type": "code", "execution_count": 24, "id": "cdefc0a2", "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split as tts" ] }, { "cell_type": "code", "execution_count": 25, "id": "bff12f60", "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = tts(X, # feature data\n", " y, # target data \n", " test_size=.25, # size of the test set\n", " random_state=42)# set a random number to get the exact split next time " ] }, { "cell_type": "code", "execution_count": 26, "id": "6d1012f9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(22155, 2) (7386, 2) (22155,) (7386,)\n" ] } ], "source": [ "print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)" ] }, { "cell_type": "markdown", "id": "3fa6df89", "metadata": {}, "source": [ "You can see that 1/4 of the data for both features and target are now separated as the test set. \n", "\n", "Let's now import the `Linear_Regression` model from the `linear_model` submodule of `sklearn`, which we will use to fit our data." ] }, { "cell_type": "code", "execution_count": 27, "id": "fa2b0d5e", "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression" ] }, { "cell_type": "markdown", "id": "2f6b0233", "metadata": {}, "source": [ "Most of the functions in sklearn can be used in the same way:\n", "\n", "1. Create an instance of the object in use.\n", "2. Use `fit` or `fit_transform` method to fit or transform the data as needed. " ] }, { "cell_type": "code", "execution_count": 28, "id": "5f9a6ade", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LinearRegression()" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = LinearRegression()\n", "model.fit(X_train,y_train)" ] }, { "cell_type": "markdown", "id": "b55db929", "metadata": {}, "source": [ "Now that the data has been fit to the linear model, some model property information will now be available in the `model` object. One of these properties is the `score` method, which return the coefficient of determination of the prediction, also known as R squared. It is the proportion of the variation in the dependent variable that is predictable form the independent variable. We can input the training data to get this score.\n", "\n", "Other properties of interest are the intercept of the and the coefficients for the linear regression model. We can get these information with the `intercept_` and `coef_` methods respectively." ] }, { "cell_type": "code", "execution_count": 29, "id": "60703497", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9806878464719238" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# return the coefficient of determination of the prediction\n", "model.score(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 30, "id": "64b2e7ff", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-50.08987832067169" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.intercept_" ] }, { "cell_type": "code", "execution_count": 31, "id": "b3c4420a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-0.00808142, 0.71893902])" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.coef_" ] }, { "cell_type": "markdown", "id": "03a4b0cf", "metadata": {}, "source": [ "While the model R Squared value looks good, this value only measures the fit of training data to the model. How well will this model perform on an unseen test data is the next step of evaluation. Regression models often use the mean squared error metric to evaluate the performance on an unseen data. To calculate this we can use `mean_squared_error` function available throuhg the `metrics` submodule of `sklearn`. The function takes the model prediction on a given data and the actual target value for that dataset. Therefore, we first need to generate predictions from our model on the test set using the `predict` method." ] }, { "cell_type": "code", "execution_count": 32, "id": "7caeb4e8", "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import mean_squared_error as mse" ] }, { "cell_type": "code", "execution_count": 33, "id": "a85abde9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "258826.95917043873\n" ] } ], "source": [ "y_pred_test = model.predict(X_test)\n", "mse_lrmodel = mse(y_pred_test, y_test)\n", "print(mse_lrmodel)" ] }, { "cell_type": "markdown", "id": "f180feb0", "metadata": {}, "source": [ "How do we know this is a good enough value?\n", "\n", "In machine learning, we usually have benchmark model against which we can test the performance. In this case we only have one model, so we can create another model and see which one performs better. \n", "\n", "We can repeat what we did earlier on another model or we can create a for-loop such that the exact same operation goes through all the models in a list. This latter is obviously better as we do not have to write the same code over and over again. It is also easier from readability perspective." ] }, { "cell_type": "code", "execution_count": 34, "id": "7e76b0eb", "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeRegressor" ] }, { "cell_type": "code", "execution_count": 35, "id": "7340a2c6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('LR', LinearRegression()), ('DT', DecisionTreeRegressor())]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# initialize an empty list to add sklearn model objects\n", "models = []\n", "\n", "# add the sklearn model objects to the list one by one\n", "# while adding the model also give it a name so put the name and model in a tuple\n", "models.append(('LR', LinearRegression())) \n", "models.append(('DT', DecisionTreeRegressor())) # Ensemble method - collection of many decision trees\n", "models" ] }, { "cell_type": "code", "execution_count": 36, "id": "bf8ae28c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LinearRegression()\n", "DecisionTreeRegressor()\n" ] }, { "data": { "text/plain": [ "{'LR': 258826.95917043873, 'DT': 491242.7129623926}" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores = {}\n", "for name, model in models:\n", " print(model)\n", " model.fit(X_train, y_train)\n", " y_pred_test = model.predict(X_test)\n", " mse_score = mse(y_pred_test, y_test)\n", " scores[name] = mse_score\n", "scores" ] }, { "cell_type": "markdown", "id": "4509c58e", "metadata": {}, "source": [ "Note that the Decision Tree Regression model has lower mean squared error than the Linear Regression model and therefore, is better.\n", "\n", "It might also be a good idea to store the fitted model, so that once you can explore more details of the model rather than just the scores. See [here](https://scikit-learn.org/stable/modules/tree.html#tree-regression) to learn more about decision tree model properties and sklearn features available to explore the model details." ] }, { "cell_type": "markdown", "id": "9b15c83d", "metadata": {}, "source": [ "### Adding Categorical Features to our Model\n", "\n", "The above model does not have the information about the ridership pattern specific to a route. So, adding this information might help predict the Sunday-Holiday ridership behavior better. \n", "\n", "There are two ways we can continue further: \n", "1. Isolate data for each route and repeat the above modelling process on that subset.\n", "2. Add the routenames to features and create a model in which case we must convert the text to numeric values.\n", "\n", "Let's try the first option on one route." ] }, { "cell_type": "code", "execution_count": 37, "id": "3b03f00c", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
routeroutenameMonth_BeginningAvg_Weekday_RidesAvg_Saturday_RidesAvg_Sunday-Holiday_RidesMonthTotalMonth_Beginning_yeardiff_week_saturdaydiff_week_sundaydiff_sat_sundayMonth_Beginning_month
141816th/18th2001-01-011923.8898.5642.34913020011025.31281.5256.21
1441816th/18th2001-02-012075.4896.8618.74757020011178.61456.7278.12
2751816th/18th2001-03-011717.8979.8710.8455332001738.01007.0269.03
4071816th/18th2001-04-011685.91003.6732.2430792001682.3953.7271.44
5391816th/18th2001-05-011662.6957.6757.3441942001705.0905.3200.35
\n", "
" ], "text/plain": [ " route routename Month_Beginning Avg_Weekday_Rides Avg_Saturday_Rides \\\n", "14 18 16th/18th 2001-01-01 1923.8 898.5 \n", "144 18 16th/18th 2001-02-01 2075.4 896.8 \n", "275 18 16th/18th 2001-03-01 1717.8 979.8 \n", "407 18 16th/18th 2001-04-01 1685.9 1003.6 \n", "539 18 16th/18th 2001-05-01 1662.6 957.6 \n", "\n", " Avg_Sunday-Holiday_Rides MonthTotal Month_Beginning_year \\\n", "14 642.3 49130 2001 \n", "144 618.7 47570 2001 \n", "275 710.8 45533 2001 \n", "407 732.2 43079 2001 \n", "539 757.3 44194 2001 \n", "\n", " diff_week_saturday diff_week_sunday diff_sat_sunday \\\n", "14 1025.3 1281.5 256.2 \n", "144 1178.6 1456.7 278.1 \n", "275 738.0 1007.0 269.0 \n", "407 682.3 953.7 271.4 \n", "539 705.0 905.3 200.3 \n", "\n", " Month_Beginning_month \n", "14 1 \n", "144 2 \n", "275 3 \n", "407 4 \n", "539 5 " ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "oneroute = data[data['routename']=='16th/18th']\n", "oneroute.head()" ] }, { "cell_type": "code", "execution_count": 38, "id": "5573f40f", "metadata": {}, "outputs": [], "source": [ "def create_model(target_features_data, targetname, models):\n", " \n", " y = target_features_data[targetname]\n", " X = target_features_data.drop(targetname, axis=1)\n", " print(X.shape, y.shape)\n", " \n", " X_train, X_test, y_train, y_test = tts(X, y, test_size=1/4, random_state=42)\n", "\n", " scores = {}\n", " for name, model in models:\n", " print(f'fitting model: {model}')\n", " model.fit(X_train, y_train)\n", " y_pred_test = model.predict(X_test)\n", " mse_score = mse(y_pred_test, y_test)\n", " scores[name] = mse_score\n", " \n", " return scores" ] }, { "cell_type": "code", "execution_count": 39, "id": "2204849d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(213, 2) (213,)\n", "fitting model: LinearRegression()\n", "fitting model: DecisionTreeRegressor()\n" ] }, { "data": { "text/plain": [ "{'LR': 27684.616857695066, 'DT': 54773.207407407404}" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols = ['Avg_Sunday-Holiday_Rides', 'Avg_Weekday_Rides','Avg_Saturday_Rides']\n", "scores_oneroute = create_model(oneroute[cols], 'Avg_Sunday-Holiday_Rides', models)\n", "scores_oneroute" ] }, { "cell_type": "markdown", "id": "40193729", "metadata": {}, "source": [ "Let's now try option 2 where we add the categorical feature routename to our model.\n", "\n", "What are the ways to convert categorical feature into numeric values? There are many. One popular method is to create a new column for each categorical variable and fill in value 1 for the specified routename column if the row belongs to that routename. This process is also called creating dummy variable.\n", "\n", "Pandas has an easy way to create dummy variables using the `get_dummies` method. Let's apply that on a toy dataset to see what transformation is taking place." ] }, { "cell_type": "code", "execution_count": 40, "id": "7ff0939d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fruitsquantity
0apple2
1orange5
\n", "
" ], "text/plain": [ " fruits quantity\n", "0 apple 2\n", "1 orange 5" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create toy data set with categorica and numeric features\n", "test_data = pd.DataFrame( [['apple', 2], ['orange', 5]], columns=['fruits', 'quantity'])\n", "test_data.head()" ] }, { "cell_type": "code", "execution_count": 41, "id": "9fafdc42", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
appleorange
010
101
\n", "
" ], "text/plain": [ " apple orange\n", "0 1 0\n", "1 0 1" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create dummy variables for a categorical column\n", "pd.get_dummies(test_data['fruits'])" ] }, { "cell_type": "code", "execution_count": 42, "id": "f9df3d13", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
appleorangequantity
0102
1015
\n", "
" ], "text/plain": [ " apple orange quantity\n", "0 1 0 2\n", "1 0 1 5" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# combine the dummy variables with remaining numeric column of the original data\n", "pd.concat([pd.get_dummies(test_data['fruits']), test_data['quantity']], axis=1)" ] }, { "cell_type": "markdown", "id": "74c393d8", "metadata": {}, "source": [ "The above transformation can also be done via `OneHotEncoder` function available through the `preprocessing` submodule of `sklearn`. This method is easier to use if you have many categorical columns. Here, we will see an example on our toy data set." ] }, { "cell_type": "code", "execution_count": 43, "id": "94c2cc9a", "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OneHotEncoder" ] }, { "cell_type": "code", "execution_count": 44, "id": "3104fbf6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<2x2 sparse matrix of type ''\n", "\twith 2 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 1. create an instance of the one hot encoder object\n", "ohe = OneHotEncoder()\n", "\n", "# 2. fit the data to the one hot encoder instance\n", "ohe.fit_transform(test_data['fruits'].values.reshape(-1,1)) # require the input data to be 2-dimensional" ] }, { "cell_type": "markdown", "id": "0699ac72", "metadata": {}, "source": [ "The resulting object is a sparse matrix, which cannot be combined to a `dataframe`. Therefore, we must first convert it to a numpy `array` for which there is a `toarray` method provided." ] }, { "cell_type": "code", "execution_count": 45, "id": "045cbac2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1., 0.],\n", " [0., 1.]])" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 3. convert the transformed data to an numpy array \n", "dummy_fruits = ohe.fit_transform(test_data['fruits'].values.reshape(-1,1)).toarray()\n", "dummy_fruits" ] }, { "cell_type": "markdown", "id": "d1ce7e28", "metadata": {}, "source": [ "Numpy `array` can be easily converted to a pandas `dataframe`." ] }, { "cell_type": "code", "execution_count": 46, "id": "50e2c584", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
01.00.0
10.01.0
\n", "
" ], "text/plain": [ " 0 1\n", "0 1.0 0.0\n", "1 0.0 1.0" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 4. transform the numpy array to dataframe\n", "pd.DataFrame(dummy_fruits)" ] }, { "cell_type": "markdown", "id": "8360abcd", "metadata": {}, "source": [ "Note that we are missing the column names, which can make it difficult to know which category the column belongs to. This can be easily retrieved using the `categories_` method of the `OneHotEncoder` object. " ] }, { "cell_type": "code", "execution_count": 47, "id": "64255a64", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
appleorange
01.00.0
10.01.0
\n", "
" ], "text/plain": [ " apple orange\n", "0 1.0 0.0\n", "1 0.0 1.0" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# while converting to dataframe add column names as well\n", "pd.DataFrame(dummy_fruits, columns=ohe.categories_)" ] }, { "cell_type": "markdown", "id": "51a8b2cb", "metadata": {}, "source": [ "Now, let's try this on routenames column of our dataset and see if adding this feature to the model helps predict better." ] }, { "cell_type": "code", "execution_count": 48, "id": "af6e8d6d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
111th/King Drive16th/18th31st31st/35th35th43rd47th51st55th/Austin55th/Narragansett...West 65thWest 95thWest CermakWest LawrenceWest Loop/South LoopWestchesterWesternWestern ExpressWilson/Michigan ExpressWrigley Field Express
00.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
10.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
20.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
30.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
40.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
\n", "

5 rows × 189 columns

\n", "
" ], "text/plain": [ " 111th/King Drive 16th/18th 31st 31st/35th 35th 43rd 47th 51st 55th/Austin \\\n", "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " 55th/Narragansett ... West 65th West 95th West Cermak West Lawrence \\\n", "0 0.0 ... 0.0 0.0 0.0 0.0 \n", "1 0.0 ... 0.0 0.0 0.0 0.0 \n", "2 0.0 ... 0.0 0.0 0.0 0.0 \n", "3 0.0 ... 0.0 0.0 0.0 0.0 \n", "4 0.0 ... 0.0 0.0 0.0 0.0 \n", "\n", " West Loop/South Loop Westchester Western Western Express \\\n", "0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 \n", "\n", " Wilson/Michigan Express Wrigley Field Express \n", "0 0.0 0.0 \n", "1 0.0 0.0 \n", "2 0.0 0.0 \n", "3 0.0 0.0 \n", "4 0.0 0.0 \n", "\n", "[5 rows x 189 columns]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ohe = OneHotEncoder()\n", "dummy_routename = ohe.fit_transform(data['routename'].values.reshape(-1,1)).toarray()\n", "dummy_routename = pd.DataFrame(dummy_routename, columns=ohe.categories_)\n", "dummy_routename.head()" ] }, { "cell_type": "code", "execution_count": 49, "id": "0463d031", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(29541, 191) (29541,)\n", "fitting model: LinearRegression()\n", "fitting model: DecisionTreeRegressor()\n" ] }, { "data": { "text/plain": [ "{'LR': 171772.43965343578, 'DT': 293345.00101623515}" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols = ['Avg_Sunday-Holiday_Rides', 'Avg_Weekday_Rides','Avg_Saturday_Rides']\n", "target_features_data = pd.concat([data[cols], dummy_routename], axis=1)\n", "scores_routename = create_model(target_features_data, 'Avg_Sunday-Holiday_Rides' , models)\n", "scores_routename" ] }, { "cell_type": "code", "execution_count": 50, "id": "80366e45", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'LR': 258826.95917043873, 'DT': 491242.7129623926}" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores" ] }, { "cell_type": "markdown", "id": "bb15e970", "metadata": {}, "source": [ "Clearly, the non-linear modelling fit suits the data and the problem better and adding the routename information helps both linear and non-linear model perform better. \n", "\n", "## Exercise 1:\n", "Create yet another model that has all features of the best model so far and add the additional information from the \"Month_Beginning_year\" column. Does adding this information increase predictability?" ] }, { "cell_type": "code", "execution_count": 51, "id": "ccd48203", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(29541, 209) (29541,)\n", "fitting model: LinearRegression()\n", "fitting model: DecisionTreeRegressor()\n" ] }, { "data": { "text/plain": [ "{'LR': 159713.08129798513, 'DT': 262424.8163688435}" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ohe = OneHotEncoder()\n", "dummy_year = ohe.fit_transform(data['Month_Beginning_year'].values.reshape(-1,1)).toarray()\n", "dummy_year = pd.DataFrame(dummy_year, columns=ohe.categories_)\n", "\n", "cols = ['Avg_Sunday-Holiday_Rides', 'Avg_Weekday_Rides','Avg_Saturday_Rides']\n", "target_features_data = pd.concat([data[cols], dummy_routename, dummy_year], axis=1)\n", "scores_routename_year = create_model(target_features_data, 'Avg_Sunday-Holiday_Rides' , models)\n", "scores_routename_year" ] }, { "cell_type": "markdown", "id": "fd3c15df", "metadata": {}, "source": [ "## Exercise 2:\n", "Create yet another model that has all features of the best model so far and add the additional information from the \"Month_Beginning_month\" column. Does adding this information increase predictability?" ] }, { "cell_type": "code", "execution_count": 52, "id": "8b9c7df6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(29541, 221) (29541,)\n", "fitting model: LinearRegression()\n", "fitting model: DecisionTreeRegressor()\n" ] }, { "data": { "text/plain": [ "{'LR': 144173.77290807283, 'DT': 213950.96260628215}" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ohe_month = OneHotEncoder()\n", "dummy_month = ohe_month.fit_transform(data['Month_Beginning_month'].values.reshape(-1,1)).toarray()\n", "dummy_month = pd.DataFrame(dummy_month, columns=ohe_month.categories_)\n", "\n", "cols = ['Avg_Sunday-Holiday_Rides', 'Avg_Weekday_Rides','Avg_Saturday_Rides']\n", "target_features_data = pd.concat([data[cols], dummy_routename, dummy_year, dummy_month], axis=1)\n", "scores_routename_yymm = create_model(target_features_data, 'Avg_Sunday-Holiday_Rides' , models)\n", "scores_routename_yymm" ] }, { "cell_type": "code", "execution_count": null, "id": "ed45ddfe", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 5 }