Data Wrangling and Cleaning
This notebook will go through the data cleaning process for Toronto's bike share data from 2017 through 2020. Hourly weather data will also be combined with the bike share data.
# Import 3rd party libraries
import warnings
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")
warnings.filterwarnings('ignore')
There are 4 years worth of Toronto bike share data, from 2017 to 2020. CSV files from 2017 and 2018 have unique column structure and naming from the 2019 and 2020 files.
CSV files from 2017 and 2018 will be updated to the same structure as the new 2019-onward files.
Columns that aren't in the old 2017/2018 files: 'Subscription Id' and 'Bike Id'
2017 and 2018 files have the same columns and names, but are in different order. Order does not matter to pd.concat() as long as the names are the same.
The 2017 data had a different naming convention for user_type
so it will be updated here. The 2018 data will be added after.
# Load and concat 2017 bike share files
trips_data_17_18 = pd.concat([pd.read_csv('project_data/' + file) for file in os.listdir('project_data') if 'bike_share_2017' in file])
# Update user_types to newer naming convention
trips_data_17_18['user_type'] = ['Annual Member' if row == 'Member' else 'Casual Member' for row in trips_data_17_18['user_type']]
# Concat with 2018 data
trips_data_17_18 = pd.concat([trips_data_17_18, pd.concat([pd.read_csv('project_data/' + file) for file in os.listdir('project_data') if 'bike_share_2018' in file])])
# Preview dataframe
trips_data_17_18.head()
trip_id | trip_start_time | trip_stop_time | trip_duration_seconds | from_station_id | from_station_name | to_station_id | to_station_name | user_type | |
---|---|---|---|---|---|---|---|---|---|
0 | 810581 | 2017-03-01 05:01 (UTC) | 2017-03-01 05:50 (UTC) | 2934 | 7202.0 | Queen St W / York St (City Hall) | 7202.0 | Queen St W / York St (City Hall) | Casual Member |
1 | 810582 | 2017-03-01 05:02 (UTC) | 2017-03-01 05:50 (UTC) | 2903 | 7202.0 | Queen St W / York St (City Hall) | 7202.0 | Queen St W / York St (City Hall) | Casual Member |
2 | 810583 | 2017-03-01 05:03 (UTC) | 2017-03-01 05:17 (UTC) | 865 | 7000.0 | Fort York Blvd / Capreol Crt | 7148.0 | King St W / Joe Shuster Way | Annual Member |
3 | 810584 | 2017-03-01 05:03 (UTC) | 2017-03-01 05:15 (UTC) | 682 | 7188.0 | Exhibition GO (Atlantic Ave) | 7176.0 | Bathurst St / Fort York Blvd | Annual Member |
4 | 810585 | 2017-03-01 05:04 (UTC) | 2017-03-01 05:13 (UTC) | 497 | 7148.0 | King St W / Joe Shuster Way | 7160.0 | King St W / Tecumseth St | Annual Member |
Column names from 2017 and 2018 are different from their names in 2019 onward. Change column names to the 2019-onward convention.
Column Mapping:
Old Name | New Name |
---|---|
trip_id | Trip Id |
trip_start_time | Start Time |
trip_stop_time | End Time |
trip_duration_seconds | Trip Duration |
from_station_id | Start Station Id |
from_station_name | Start Station Name |
to_station_id | End Station Id |
to_station_name | End Station Name |
user_type | User Type |
# Rename the columns to the style of the 2019/2020 files
trips_data_17_18 = trips_data_17_18.rename(columns={
'trip_id':'Trip Id',
'trip_start_time':'Start Time',
'trip_stop_time':'End Time',
'trip_duration_seconds':'Trip Duration',
'from_station_id':'Start Station Id',
'from_station_name':'Start Station Name',
'to_station_id':'End Station Id',
'to_station_name':'End Station Name',
'user_type':'User Type'})
# Preview dataframe
trips_data_17_18.head()
Trip Id | Start Time | End Time | Trip Duration | Start Station Id | Start Station Name | End Station Id | End Station Name | User Type | |
---|---|---|---|---|---|---|---|---|---|
0 | 810581 | 2017-03-01 05:01 (UTC) | 2017-03-01 05:50 (UTC) | 2934 | 7202.0 | Queen St W / York St (City Hall) | 7202.0 | Queen St W / York St (City Hall) | Casual Member |
1 | 810582 | 2017-03-01 05:02 (UTC) | 2017-03-01 05:50 (UTC) | 2903 | 7202.0 | Queen St W / York St (City Hall) | 7202.0 | Queen St W / York St (City Hall) | Casual Member |
2 | 810583 | 2017-03-01 05:03 (UTC) | 2017-03-01 05:17 (UTC) | 865 | 7000.0 | Fort York Blvd / Capreol Crt | 7148.0 | King St W / Joe Shuster Way | Annual Member |
3 | 810584 | 2017-03-01 05:03 (UTC) | 2017-03-01 05:15 (UTC) | 682 | 7188.0 | Exhibition GO (Atlantic Ave) | 7176.0 | Bathurst St / Fort York Blvd | Annual Member |
4 | 810585 | 2017-03-01 05:04 (UTC) | 2017-03-01 05:13 (UTC) | 497 | 7148.0 | King St W / Joe Shuster Way | 7160.0 | King St W / Tecumseth St | Annual Member |
Times are formatted as '%Y-%m-%d %H:%M (%Z)' for these two years.
# Convert start and end times to datetime objects and convert from UTC to EST
trips_data_17_18['Start Time'] = pd.to_datetime(trips_data_17_18['Start Time'], format='%Y-%m-%d %H:%M (%Z)').dt.tz_convert('EST')
trips_data_17_18['End Time'] = pd.to_datetime(trips_data_17_18['End Time'], format='%Y-%m-%d %H:%M (%Z)').dt.tz_convert('EST')
# Preview dataframe
trips_data_17_18.head()
Trip Id | Start Time | End Time | Trip Duration | Start Station Id | Start Station Name | End Station Id | End Station Name | User Type | |
---|---|---|---|---|---|---|---|---|---|
0 | 810581 | 2017-03-01 00:01:00-05:00 | 2017-03-01 00:50:00-05:00 | 2934 | 7202.0 | Queen St W / York St (City Hall) | 7202.0 | Queen St W / York St (City Hall) | Casual Member |
1 | 810582 | 2017-03-01 00:02:00-05:00 | 2017-03-01 00:50:00-05:00 | 2903 | 7202.0 | Queen St W / York St (City Hall) | 7202.0 | Queen St W / York St (City Hall) | Casual Member |
2 | 810583 | 2017-03-01 00:03:00-05:00 | 2017-03-01 00:17:00-05:00 | 865 | 7000.0 | Fort York Blvd / Capreol Crt | 7148.0 | King St W / Joe Shuster Way | Annual Member |
3 | 810584 | 2017-03-01 00:03:00-05:00 | 2017-03-01 00:15:00-05:00 | 682 | 7188.0 | Exhibition GO (Atlantic Ave) | 7176.0 | Bathurst St / Fort York Blvd | Annual Member |
4 | 810585 | 2017-03-01 00:04:00-05:00 | 2017-03-01 00:13:00-05:00 | 497 | 7148.0 | King St W / Joe Shuster Way | 7160.0 | King St W / Tecumseth St | Annual Member |
Station IDs in some files were stored as non-integer, so force station ID columns to integer type.
# Cast Int64 datatype to station id columns
trips_data_17_18 = trips_data_17_18.astype({'Start Station Id':'Int64', 'End Station Id':'Int64'})
# Preview dataframe
trips_data_17_18.head()
Trip Id | Start Time | End Time | Trip Duration | Start Station Id | Start Station Name | End Station Id | End Station Name | User Type | |
---|---|---|---|---|---|---|---|---|---|
0 | 810581 | 2017-03-01 00:01:00-05:00 | 2017-03-01 00:50:00-05:00 | 2934 | 7202 | Queen St W / York St (City Hall) | 7202 | Queen St W / York St (City Hall) | Casual Member |
1 | 810582 | 2017-03-01 00:02:00-05:00 | 2017-03-01 00:50:00-05:00 | 2903 | 7202 | Queen St W / York St (City Hall) | 7202 | Queen St W / York St (City Hall) | Casual Member |
2 | 810583 | 2017-03-01 00:03:00-05:00 | 2017-03-01 00:17:00-05:00 | 865 | 7000 | Fort York Blvd / Capreol Crt | 7148 | King St W / Joe Shuster Way | Annual Member |
3 | 810584 | 2017-03-01 00:03:00-05:00 | 2017-03-01 00:15:00-05:00 | 682 | 7188 | Exhibition GO (Atlantic Ave) | 7176 | Bathurst St / Fort York Blvd | Annual Member |
4 | 810585 | 2017-03-01 00:04:00-05:00 | 2017-03-01 00:13:00-05:00 | 497 | 7148 | King St W / Joe Shuster Way | 7160 | King St W / Tecumseth St | Annual Member |
# Load and concat 2019/2020 bike share files
trips_data = pd.concat([pd.read_csv('project_data/' + file) for file in os.listdir('project_data') if 'bike_share_2019' in file or 'bike_share_2020' in file])
trips_data = trips_data.rename(columns={'Trip Duration':'Trip Duration'})
# Preview dataframe
trips_data.head()
Trip Id | Subscription Id | Trip Duration | Start Station Id | Start Time | Start Station Name | End Station Id | End Time | End Station Name | Bike Id | User Type | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5960472 | 321707 | 915 | 7132 | 01/08/2019 00:00:00 (EST) | Kendal Ave / Spadina Rd | 7245.0 | 01/08/2019 00:15:00 (EST) | Tecumseth St / Queen St W - SMART | 2876 | Annual Member |
1 | 5960475 | 253176 | 331 | 7019 | 01/08/2019 00:01:00 (EST) | Temperance St. Station | 7022.0 | 01/08/2019 00:07:00 (EST) | Simcoe St / Queen St W | 907 | Annual Member |
2 | 5960476 | 415514 | 206 | 7256 | 01/08/2019 00:01:00 (EST) | Vanauley St / Queen St W - SMART | 7260.0 | 01/08/2019 00:05:00 (EST) | Spadina Ave / Adelaide St W | 4442 | Casual Member |
3 | 5960474 | 418474 | 778 | 7382 | 01/08/2019 00:01:00 (EST) | Simcoe St / Adelaide St W | 7110.0 | 01/08/2019 00:14:00 (EST) | Queen St E / Berkeley St | 1646 | Casual Member |
4 | 5960477 | 418253 | 1704 | 7267 | 01/08/2019 00:02:00 (EST) | Dundas St E / Pembroke St - SMART | 7330.0 | 01/08/2019 00:31:00 (EST) | Lee Ave / Queen St E | 2274 | Casual Member |
Times are formatted as '%d/%m/%Y %H:%M:%S (%Z)' for these two years.
# Convert start and end times to datetime objects localized to EST
trips_data['Start Time'] = pd.to_datetime(trips_data['Start Time'], format='%d/%m/%Y %H:%M:%S (%Z)')
trips_data['End Time'] = pd.to_datetime(trips_data['End Time'], format='%d/%m/%Y %H:%M:%S (%Z)')
# Preview dataframe
trips_data.head()
Trip Id | Subscription Id | Trip Duration | Start Station Id | Start Time | Start Station Name | End Station Id | End Time | End Station Name | Bike Id | User Type | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5960472 | 321707 | 915 | 7132 | 2019-08-01 00:00:00-05:00 | Kendal Ave / Spadina Rd | 7245.0 | 2019-08-01 00:15:00-05:00 | Tecumseth St / Queen St W - SMART | 2876 | Annual Member |
1 | 5960475 | 253176 | 331 | 7019 | 2019-08-01 00:01:00-05:00 | Temperance St. Station | 7022.0 | 2019-08-01 00:07:00-05:00 | Simcoe St / Queen St W | 907 | Annual Member |
2 | 5960476 | 415514 | 206 | 7256 | 2019-08-01 00:01:00-05:00 | Vanauley St / Queen St W - SMART | 7260.0 | 2019-08-01 00:05:00-05:00 | Spadina Ave / Adelaide St W | 4442 | Casual Member |
3 | 5960474 | 418474 | 778 | 7382 | 2019-08-01 00:01:00-05:00 | Simcoe St / Adelaide St W | 7110.0 | 2019-08-01 00:14:00-05:00 | Queen St E / Berkeley St | 1646 | Casual Member |
4 | 5960477 | 418253 | 1704 | 7267 | 2019-08-01 00:02:00-05:00 | Dundas St E / Pembroke St - SMART | 7330.0 | 2019-08-01 00:31:00-05:00 | Lee Ave / Queen St E | 2274 | Casual Member |
Station IDs, like the 2017/2018 data, were sometimes stored as non-integer, so force station ID columns to integer type.
# Cast Int64 datatype to station id columns
trips_data = trips_data.astype({'Start Station Id':'Int64', 'End Station Id':'Int64'})
# Preview dataframe
trips_data.head()
Trip Id | Subscription Id | Trip Duration | Start Station Id | Start Time | Start Station Name | End Station Id | End Time | End Station Name | Bike Id | User Type | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5960472 | 321707 | 915 | 7132 | 2019-08-01 00:00:00-05:00 | Kendal Ave / Spadina Rd | 7245 | 2019-08-01 00:15:00-05:00 | Tecumseth St / Queen St W - SMART | 2876 | Annual Member |
1 | 5960475 | 253176 | 331 | 7019 | 2019-08-01 00:01:00-05:00 | Temperance St. Station | 7022 | 2019-08-01 00:07:00-05:00 | Simcoe St / Queen St W | 907 | Annual Member |
2 | 5960476 | 415514 | 206 | 7256 | 2019-08-01 00:01:00-05:00 | Vanauley St / Queen St W - SMART | 7260 | 2019-08-01 00:05:00-05:00 | Spadina Ave / Adelaide St W | 4442 | Casual Member |
3 | 5960474 | 418474 | 778 | 7382 | 2019-08-01 00:01:00-05:00 | Simcoe St / Adelaide St W | 7110 | 2019-08-01 00:14:00-05:00 | Queen St E / Berkeley St | 1646 | Casual Member |
4 | 5960477 | 418253 | 1704 | 7267 | 2019-08-01 00:02:00-05:00 | Dundas St E / Pembroke St - SMART | 7330 | 2019-08-01 00:31:00-05:00 | Lee Ave / Queen St E | 2274 | Casual Member |
Combine the 2017/2018 dataset to the 2019/2020 dataset. The older data does not have 'Subscription Id' and 'Bike Id' so those columns will contain NaN for years 2017/2018.
Delete the 2017/2018 dataframe to save memory.
# Concatenate dataframes into trips_data
trips_data = pd.concat([trips_data_17_18, trips_data]).sort_values(by=['Trip Id'])
del trips_data_17_18
# Preview dataframe
trips_data.head()
Trip Id | Start Time | End Time | Trip Duration | Start Station Id | Start Station Name | End Station Id | End Station Name | User Type | Subscription Id | Bike Id | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 712382 | 2016-12-31 20:00:00-05:00 | 2016-12-31 20:03:00-05:00 | 223 | 7051 | Wellesley St E / Yonge St Green P | 7089 | Church St / Wood St | Annual Member | NaN | NaN |
1 | 712383 | 2016-12-31 20:00:00-05:00 | 2016-12-31 20:05:00-05:00 | 279 | 7143 | Kendal Ave / Bernard Ave | 7154 | Bathurst Subway Station | Annual Member | NaN | NaN |
2 | 712384 | 2016-12-31 20:05:00-05:00 | 2016-12-31 20:29:00-05:00 | 1394 | 7113 | Parliament St / Aberdeen Ave | 7199 | College St W / Markham St | Annual Member | NaN | NaN |
3 | 712385 | 2016-12-31 20:07:00-05:00 | 2016-12-31 20:21:00-05:00 | 826 | 7077 | College Park South | 7010 | King St W / Spadina Ave | Annual Member | NaN | NaN |
4 | 712386 | 2016-12-31 20:08:00-05:00 | 2016-12-31 20:12:00-05:00 | 279 | 7079 | McGill St / Church St | 7047 | University Ave / Gerrard St W | Annual Member | NaN | NaN |
Check for rows with invalid/missing data, as well as delete any duplicate trips.
# Drop duplicates
trips_data.drop_duplicates(subset=['Trip Id'], keep='last', inplace=True)
# Check for number of missing values
trips_data.isnull().sum(axis=0).to_frame('count')
count | |
---|---|
Trip Id | 0 |
Start Time | 0 |
End Time | 1 |
Trip Duration | 0 |
Start Station Id | 1026893 |
Start Station Name | 164 |
End Station Id | 1028159 |
End Station Name | 1409 |
User Type | 0 |
Subscription Id | 3415324 |
Bike Id | 3415324 |
The rows with missing values in all columns except for 'Subscription Id' and 'Bike Id' are to be removed, as these are most likely invalid trips. However, some files did not have 'Station Id's but did have 'Station Name'. Rows will be removed if they do not have start and end stations, regardless if it is given as 'Station Id' or 'Station Name'.
Addtionally, any trips that are less than 60 seconds long are most likely invalid trips as well.
Lastly, trips will be cut down to remove outliers. Any values that are outside 1.5*IQR from the 1st and 3rd quartiles will be considered outliers and removed.
# Remove rows with invalid NaN values, ignore Station info, Sub ID, Bike ID
trips_data.dropna(axis=0, how='any', subset=['Trip Id', 'Start Time', 'End Time', 'Trip Duration'], inplace=True)
trips_data.dropna(axis=0, how='all', subset=['Start Station Id', 'Start Station Name'], inplace=True)
trips_data.dropna(axis=0, how='all', subset=['End Station Id', 'End Station Name'], inplace=True)
# Remove trips with <60 second duration
trips_data = trips_data[trips_data['Trip Duration'] >= 60]
# Remove outliers
trip_dur_Q1 = trips_data['Trip Duration'].quantile(0.25)
trip_dur_Q3 = trips_data['Trip Duration'].quantile(0.75)
trip_dur_IQR = trip_dur_Q3 - trip_dur_Q1
trips_data = trips_data[((trips_data['Trip Duration'] >= (trip_dur_Q1 - 1.5 * trip_dur_IQR)))
& ((trips_data['Trip Duration'] <= (trip_dur_Q3 + 1.5 * trip_dur_IQR)))]
# Check for number of missing values
trips_data.isnull().sum(axis=0).to_frame('count')
count | |
---|---|
Trip Id | 0 |
Start Time | 0 |
End Time | 0 |
Trip Duration | 0 |
Start Station Id | 944733 |
Start Station Name | 156 |
End Station Id | 944733 |
End Station Name | 133 |
User Type | 0 |
Subscription Id | 3232427 |
Bike Id | 3232427 |
Check the distribution of 'Trip Durations'
# Create new column with trip duration in mins
trips_data['Trip Duration (mins)'] = trips_data['Trip Duration'].divide(60)
# Plot Histogram
plt.figure()
fig, ax = plt.subplots(figsize=(10, 6))
ax = sns.histplot(data=trips_data, x='Trip Duration (mins)', kde=True, binwidth=1, stat='count')
ax.set_title('Distribution of Trip Durations')
ax.set_xlabel('Trip Duration, Minutes')
ax.set_ylabel('Count')
Text(0, 0.5, 'Count')
<Figure size 432x288 with 0 Axes>
# Save dataframe
#trips_data.to_pickle('trips_data_clean_ish.pkl')
As mentioned earlier, some 'Station Id' were missing, as well as some 'Station Name' from Start and End columns.
Using the bikeshare_stations.csv file as the reference, assign proper 'Station Id' and replace all 'Station Name' with the correct station name that matches the reference.
fuzzywuzzy will be used to best match the station names.
# Read data from saved pickle file
#trips_data = pd.read_pickle('trips_data_clean_ish.pkl')
# Import fuzzywuzzy. SLOW!!
# from fuzzywuzzy import process, fuzz
# Import rapidfuzz (C++ optimized fuzzywuzzy)
# See https://github.com/maxbachmann/rapidfuzz
from rapidfuzz import process, fuzz
Import the bikshare_stations.csv
file to get the correct station names and locations.
# Read stations data from file
stations = pd.read_csv('project_data/' + 'bikeshare_stations.csv')
# Replace double spaces with just 1 space
stations['Station Name'] = stations['Station Name'].str.replace(' ', ' ')
# Create dictionary of with station Id's as key and Station Name as value
station_names = stations[['Station Id', 'Station Name']].set_index('Station Id').to_dict('index')
station_names = {key: value['Station Name'] for key, value in station_names.items()}
# Create an inverse dictionary, id: station name
station_ids = {value: key for key, value in station_names.items()}
stations.head()
Station Id | Station Name | lat | lon | capacity | |
---|---|---|---|---|---|
0 | 7000 | Fort York Blvd / Capreol Ct | 43.639832 | -79.395954 | 35 |
1 | 7001 | Lower Jarvis St / The Esplanade | 43.647830 | -79.370698 | 15 |
2 | 7002 | St. George St / Bloor St W | 43.667333 | -79.399429 | 19 |
3 | 7003 | Madison Ave / Bloor St W | 43.667158 | -79.402761 | 15 |
4 | 7004 | University Ave / Elm St | 43.656518 | -79.389099 | 11 |
Some of the station id's in trips_data
do not exist in bikeshare_stations.csv
.
Stations will be matched on Station Id first, then if the Station Id does not exist, fuzzywuzzy
will be used to find the closest match.
Below, we can check if the Station Ids/Station Names in trips_data
match up with the correct Station Ids/Station Names in bikeshare_stations.csv
.
# Fetches name corresponding to id, if missing keep existing name
def id_to_name(id, name, choices):
if str(id) not in choices:
match = name
else:
match = choices[str(id)]
return match
no_ids = trips_data[~trips_data['Start Station Id'].isin(station_names)][['Start Station Name']]
has_ids = trips_data[trips_data['Start Station Id'].isin(station_names)][['Start Station Id', 'Start Station Name']]
has_ids['correct_name'] = np.vectorize(id_to_name)(has_ids['Start Station Id'], has_ids['Start Station Name'],
{str(key): value for key, value in station_names.items()})
has_ids.dropna(axis=0, how='any', inplace=True)
has_ids['fuzzy_score'] = np.vectorize(fuzz.token_set_ratio)(has_ids['Start Station Name'], has_ids['correct_name'])
has_ids.describe()
Start Station Id | fuzzy_score | |
---|---|---|
count | 7.095064e+06 | 7.095064e+06 |
mean | 7.184661e+03 | 9.886202e+01 |
std | 1.465230e+02 | 8.162003e+00 |
min | 7.000000e+03 | 2.727273e+01 |
25% | 7.054000e+03 | 1.000000e+02 |
50% | 7.163000e+03 | 1.000000e+02 |
75% | 7.278000e+03 | 1.000000e+02 |
max | 7.658000e+03 | 1.000000e+02 |
We can see above that the minimum fuzzy score is 27, and the max fuzzy score is 100, meaning there are many rides where the given Station Id does not have the correct Station Name.
For example, one ride starts at "Station 7051", and has the name as "Wellesley St E / Yonge St Green P." However, in the bikeshare_stations.csv
file, station ID 7051 is at "Mutual St / Shuter St" and manually checking the given lon/lat reveals that the true location is at "Mutual St / Shuter St."
In fact, there is no "Wellesley St E / Yonge St Green P" station in bikeshare_stations.csv
. Without further information, we will assume that the station IDs are more trustworthy over the given station names in trips_data
.
However, many of the rides had IDs that did not exist in the bikeshare_stations.csv
file. After manually checking, the names these stations had could not be found in bikeshare_stations.csv
either. These rows will be dropped, as there is no information on the geographic location of these stations that are not in the given bikeshare_stations.csv
file.
# Drop rows where the Start Station Id and End Station Id are not in the database
trips_data = trips_data[trips_data['Start Station Id'].isin(station_names) | trips_data['Start Station Id'].isna()]
trips_data = trips_data[trips_data['End Station Id'].isin(station_names) | trips_data['End Station Id'].isna()]
# Create boolean selectors for each case
has_id_no_name_start = trips_data['Start Station Name'].isna() & ~trips_data['Start Station Id'].isna()
has_id_no_name_end = trips_data['End Station Name'].isna() & ~trips_data['End Station Id'].isna()
# Fill in station names where the ride has an id, but no name
trips_data.loc[has_id_no_name_start, 'Start Station Name'] = [station_names[key]
for key in trips_data.loc[has_id_no_name_start, 'Start Station Id']]
trips_data.loc[has_id_no_name_end, 'End Station Name'] = [station_names[key]
for key in trips_data.loc[has_id_no_name_end, 'End Station Id']]
trips_data[['Start Station Name', 'End Station Name']].isnull().sum(axis=0).to_frame()
0 | |
---|---|
Start Station Name | 0 |
End Station Name | 0 |
Now, there are no more missing values in the name columns in trips_data
Using the station names, find the corresponding station ID for all rows.
# Returns id corresponding to name, if available, otherwise 0000
def name_to_id(name, stations, ids):
# if name is not in the bikeshare_stations.csv dict
if name not in ids:
match_name, ratio, match_id = process.extractOne(name, stations)
if ratio > 85:
return match_id
# Placeholder id 0000, which will be dropped
# np.vectorize does not like np.nan values
else:
return 0000
# if the name is in the bikeshare_stations.csv dict
else:
return ids[name]
# May take 500 seconds or more for dataset size of 8 million+ rides
# Note: if using fuzzywuzzy instead of rapidfuzz, may take even longer
trips_data['Start Station Id'] = np.vectorize(name_to_id)(trips_data['Start Station Name'], station_names, station_ids)
# May take 500 seconds or more for dataset size of 8 million+ rides
# Note: if using fuzzywuzzy instead of rapidfuzz, may take even longer
trips_data['End Station Id'] = np.vectorize(name_to_id)(trips_data['End Station Name'], station_names, station_ids)
# Drop any rides with ID 0000
trips_data = trips_data[(trips_data['Start Station Id'] != 0000) & (trips_data['End Station Id'] != 0000)]
# Replace all names with the name corresponding to the Station ID
trips_data['Start Station Name'] = [station_names[key] for key in trips_data['Start Station Id']]
trips_data['End Station Name'] = [station_names[key] for key in trips_data['End Station Id']]
Checking for null values shows that the station information is okay now.
trips_data.isnull().sum(axis=0).to_frame()
0 | |
---|---|
Trip Id | 0 |
Start Time | 0 |
End Time | 0 |
Trip Duration | 0 |
Start Station Id | 0 |
Start Station Name | 0 |
End Station Id | 0 |
End Station Name | 0 |
User Type | 0 |
Subscription Id | 3232301 |
Bike Id | 3232301 |
Trip Duration (mins) | 0 |
Save trips_data
to pickle format as a save point.
# Save data frame to pickle format
#trips_data.to_pickle('trips_data_names_corrected.pkl')
Now, add in the geographic information for each bikeshare station into trips_data
# Import geopandas
import geopandas as gpd
from shapely.geometry import MultiPoint
# Merge lat and lon from bikeshare_stations.csv to trips_data
stations_loc = stations.set_index('Station Id')[['lon', 'lat']].to_dict('index')
stations_loc = {k: (v['lon'], v['lat']) for k, v in stations_loc.items()}
trips_data['Start Point'] = [stations_loc[x] for x in trips_data['Start Station Id']]
trips_data['End Point'] = [stations_loc[x] for x in trips_data['End Station Id']]
trips_data['geometry'] = trips_data.apply(lambda row: MultiPoint([row['Start Point'], row['End Point']]), axis=1)
trips_data = gpd.GeoDataFrame(trips_data)
# Save to pickle
#trips_data.to_pickle('trips_data_with_geo.pkl')
# Load data from pickle file
#trips_data = pd.read_pickle('trips_data_with_geo.pkl')
Weather data from the provided weather files will be added to trips_data
# Read in weather data to a single dataframe
file_list = {file.split('-')[1][:4] + file.split('-')[0][-2:]: file
for file in os.listdir('project_data') if 'en_climate_hourly' in file}
# Sort files by year and month
file_list = {k: v for k, v in sorted(file_list.items())}
# Read in weather files
weather_data = pd.concat([pd.read_csv('project_data/' + value)
for key, value in file_list.items()])
weather_data = weather_data.set_index(pd.DatetimeIndex(
weather_data['Date/Time'])).drop(columns='Date/Time')
weather_data.index = weather_data.index.tz_localize('EST')
Verify that the weather data and temperatures look valid.
# Write your code here.
plt.figure()
fig, ax = plt.subplots(figsize=(8, 4))
ax = sns.lineplot(x=weather_data.index, y=weather_data['Temp (°C)'])
ax.set_title('Temperature in the Toronto City Centre for 2017-2020')
ax.set_xlabel('Date-Time')
ax.set_ylabel('Temperature °C')
plt.show()
<Figure size 432x288 with 0 Axes>
The start time for each ride will be rounded to the nearest hour, then merged with the weather_data
since the data is given in hourly weather.
trips_data['merge_time'] = trips_data['Start Time'].dt.round(freq='H')
trips_data = trips_data.merge(weather_data, left_on='merge_time', right_on='Date/Time')
The Weather
column in weather_data
was left as nan
if the conditions were clear. These nan
's will be replaced by clear.
trips_data['Weather'] = trips_data['Weather'].fillna('Clear')
trips_data
Trip Id | Start Time | End Time | Trip Duration | Start Station Id | Start Station Name | End Station Id | End Station Name | User Type | Subscription Id | ... | Wind Spd Flag | Visibility (km) | Visibility Flag | Stn Press (kPa) | Stn Press Flag | Hmdx | Hmdx Flag | Wind Chill | Wind Chill Flag | Weather | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 712431 | 2016-12-31 23:43:00-05:00 | 2016-12-31 23:51:00-05:00 | 494 | 7163 | Yonge St / Wood St | 7634 | University Ave / Gerrard St W (West Side) | Annual Member | NaN | ... | NaN | 16.1 | NaN | 99.81 | NaN | NaN | NaN | NaN | NaN | Clear |
1 | 712432 | 2016-12-31 23:43:00-05:00 | 2016-12-31 23:50:00-05:00 | 425 | 7163 | Yonge St / Wood St | 7634 | University Ave / Gerrard St W (West Side) | Annual Member | NaN | ... | NaN | 16.1 | NaN | 99.81 | NaN | NaN | NaN | NaN | NaN | Clear |
2 | 712433 | 2016-12-31 23:44:00-05:00 | 2016-12-31 23:50:00-05:00 | 388 | 7163 | Yonge St / Wood St | 7634 | University Ave / Gerrard St W (West Side) | Annual Member | NaN | ... | NaN | 16.1 | NaN | 99.81 | NaN | NaN | NaN | NaN | NaN | Clear |
3 | 712435 | 2016-12-31 23:48:00-05:00 | 2017-01-01 00:02:00-05:00 | 851 | 7284 | University Ave / King St W - SMART | 7046 | Niagara St / Richmond St W | Annual Member | NaN | ... | NaN | 16.1 | NaN | 99.81 | NaN | NaN | NaN | NaN | NaN | Clear |
4 | 712436 | 2016-12-31 23:48:00-05:00 | 2017-01-01 00:00:00-05:00 | 693 | 7070 | 25 York St – Union Station South | 7172 | Strachan Ave / Princes' Blvd | Annual Member | NaN | ... | NaN | 16.1 | NaN | 99.81 | NaN | NaN | NaN | NaN | NaN | Clear |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8039361 | 10293869 | 2020-10-31 23:57:00-05:00 | 2020-11-01 00:05:00-05:00 | 497 | 7302 | Winona Dr / Davenport Rd - SMART | 7144 | Palmerston Ave / Vermont Ave - SMART | Annual Member | 534405.0 | ... | NaN | 16.1 | NaN | 100.70 | NaN | NaN | NaN | NaN | NaN | Clear |
8039362 | 10293870 | 2020-10-31 23:57:00-05:00 | 2020-11-01 00:02:00-05:00 | 272 | 7129 | Davenport Rd / Avenue Rd | 7130 | Davenport Rd / Bedford Rd | Annual Member | 630937.0 | ... | NaN | 16.1 | NaN | 100.70 | NaN | NaN | NaN | NaN | NaN | Clear |
8039363 | 10293872 | 2020-10-31 23:58:00-05:00 | 2020-11-01 00:09:00-05:00 | 646 | 7156 | Salem Ave / Bloor St W - SMART | 7232 | Queen St W / Fuller Ave - SMART | Annual Member | 539030.0 | ... | NaN | 16.1 | NaN | 100.70 | NaN | NaN | NaN | NaN | NaN | Clear |
8039364 | 10293874 | 2020-10-31 23:59:00-05:00 | 2020-11-01 00:06:00-05:00 | 462 | 7026 | Bay St / St. Joseph St | 7453 | Jarvis St / Maitland Pl | Casual Member | 863192.0 | ... | NaN | 16.1 | NaN | 100.70 | NaN | NaN | NaN | NaN | NaN | Clear |
8039365 | 10293875 | 2020-10-31 23:59:00-05:00 | 2020-11-01 00:11:00-05:00 | 726 | 7339 | Carlaw Ave / Strathcona Ave | 7311 | Sherbourne St / Isabella St | Annual Member | 524592.0 | ... | NaN | 16.1 | NaN | 100.70 | NaN | NaN | NaN | NaN | NaN | Clear |
8039366 rows × 43 columns
At this point, the dataframe should be ready for exploratory data analysis. Save the file as pickle format, which maintains the geodataframe structure and attributes.
# Export file
trips_data.to_pickle('trips_data_final.pkl')