CIV1498 - Introduction to Data Science

Team https://xkcd.com/1838/

Data Wrangling and Cleaning

This notebook will go through the data cleaning process for Toronto's bike share data from 2017 through 2020. Hourly weather data will also be combined with the bike share data.

Bike Share Trips Data

There are 4 years worth of Toronto bike share data, from 2017 to 2020. CSV files from 2017 and 2018 have unique column structure and naming from the 2019 and 2020 files.

CSV files from 2017 and 2018 will be updated to the same structure as the new 2019-onward files.

Columns that aren't in the old 2017/2018 files: 'Subscription Id' and 'Bike Id'

2017 and 2018 data

Loading files to dataframe

2017 and 2018 files have the same columns and names, but are in different order. Order does not matter to pd.concat() as long as the names are the same.

The 2017 data had a different naming convention for user_type so it will be updated here. The 2018 data will be added after.

Rename Columns

Column names from 2017 and 2018 are different from their names in 2019 onward. Change column names to the 2019-onward convention.

Column Mapping:

Old Name New Name
trip_id Trip Id
trip_start_time Start Time
trip_stop_time End Time
trip_duration_seconds Trip Duration
from_station_id Start Station Id
from_station_name Start Station Name
to_station_id End Station Id
to_station_name End Station Name
user_type User Type

Convert Time Fields to datetime

Times are formatted as '%Y-%m-%d %H:%M (%Z)' for these two years.

Fix Data to Match 2019-Onward Data

Station IDs in some files were stored as non-integer, so force station ID columns to integer type.

2019 and 2020 data

Loading Files to Dataframe

2019 and 2020 files have the same data structure.

Additionally, for some reason 'Trip Duration' column has two spaces. Remove the extra space.

Convert Time Fields to datetime

Times are formatted as '%d/%m/%Y %H:%M:%S (%Z)' for these two years.

Fix Data Type

Station IDs, like the 2017/2018 data, were sometimes stored as non-integer, so force station ID columns to integer type.

Concatenate Data and Clean

Combine the 2017/2018 dataset to the 2019/2020 dataset. The older data does not have 'Subscription Id' and 'Bike Id' so those columns will contain NaN for years 2017/2018.

Delete the 2017/2018 dataframe to save memory.

Invalid Data

Check for rows with invalid/missing data, as well as delete any duplicate trips.

The rows with missing values in all columns except for 'Subscription Id' and 'Bike Id' are to be removed, as these are most likely invalid trips. However, some files did not have 'Station Id's but did have 'Station Name'. Rows will be removed if they do not have start and end stations, regardless if it is given as 'Station Id' or 'Station Name'.

Addtionally, any trips that are less than 60 seconds long are most likely invalid trips as well.

Lastly, trips will be cut down to remove outliers. Any values that are outside 1.5*IQR from the 1st and 3rd quartiles will be considered outliers and removed.

Plotting Distribution of Trip Durations

Check the distribution of 'Trip Durations'

Clean Station Id and Station Name Columns

As mentioned earlier, some 'Station Id' were missing, as well as some 'Station Name' from Start and End columns.

Using the bikeshare_stations.csv file as the reference, assign proper 'Station Id' and replace all 'Station Name' with the correct station name that matches the reference.

fuzzywuzzy will be used to best match the station names.

Import the bikshare_stations.csv file to get the correct station names and locations.

Some of the station id's in trips_data do not exist in bikeshare_stations.csv.

Stations will be matched on Station Id first, then if the Station Id does not exist, fuzzywuzzy will be used to find the closest match.

Below, we can check if the Station Ids/Station Names in trips_data match up with the correct Station Ids/Station Names in bikeshare_stations.csv.

We can see above that the minimum fuzzy score is 27, and the max fuzzy score is 100, meaning there are many rides where the given Station Id does not have the correct Station Name.

For example, one ride starts at "Station 7051", and has the name as "Wellesley St E / Yonge St Green P." However, in the bikeshare_stations.csv file, station ID 7051 is at "Mutual St / Shuter St" and manually checking the given lon/lat reveals that the true location is at "Mutual St / Shuter St."

In fact, there is no "Wellesley St E / Yonge St Green P" station in bikeshare_stations.csv. Without further information, we will assume that the station IDs are more trustworthy over the given station names in trips_data.

However, many of the rides had IDs that did not exist in the bikeshare_stations.csv file. After manually checking, the names these stations had could not be found in bikeshare_stations.csv either. These rows will be dropped, as there is no information on the geographic location of these stations that are not in the given bikeshare_stations.csv file.

Now, there are no more missing values in the name columns in trips_data

Using the station names, find the corresponding station ID for all rows.

Checking for null values shows that the station information is okay now.

Save trips_data to pickle format as a save point.

Geographic/Location Information

Now, add in the geographic information for each bikeshare station into trips_data

Add in Weather Data

Weather data from the provided weather files will be added to trips_data

Load Weather Data

Verify that the weather data and temperatures look valid.

Merge Weather Data

The start time for each ride will be rounded to the nearest hour, then merged with the weather_data since the data is given in hourly weather.

The Weather column in weather_data was left as nan if the conditions were clear. These nan's will be replaced by clear.

Export dataframe

At this point, the dataframe should be ready for exploratory data analysis. Save the file as pickle format, which maintains the geodataframe structure and attributes.