CIV1498 - Introduction to Data Science

Team https://xkcd.com/1838/

Modelling

This notebook will go through the modelling process for Toronto's bike share data from 2017 through 2020.

There were about <500 rows from the weather data were missing temperatures and humidities. If the row has a NaN value for these two columns, it will be dropped as linear regression cannot accept NaN values. This amount of missing data should be neglible, since there are almost 4 years worth of hours.

Additionally, there were some rides in the bike_share_2017-12.csv file that were from 2017 January, even though the file is for rides in December. When converted from UTC time to EST, these rides, which were at about 1 am, rolled back to 2016. Since there are only a few rides from 2016, these will be dropped.

Feature Engineering

This section will contain a processing pipeline to create the necessary features that will be inputed to the model to predict hourly bikeshare demand.

The data will be split into 70% for training, 15% validating, and 15% testing as the dataset has a large amount of rides, just over 8 million.

Processing

This function will be used to group the rides into hours.

The function above creates new features that may be used in the model. The features included were selected based on analysis in part 2 of this report.

Time features were encoded into cyclical features, as this method may be better at representing the hourly/weekly/monthly cycles vs bikeshare usage relationship. The reason these cyclical features may perform better is because the model would consider the time 00:00 and 23:59 to be 1439 minutes apart, when the gap between these two times may just be 1 minute apart.

Below, the data will be grouped after they have been split into train, val, and test to avoid data leakage. Grouping before would change some aspects of the data since the duration, wind speed, etc., are averaged.

Model Selection

This section will compare various models in order to find the most appropriate type for predicting hourly bikeshare demand.

The first model is a (naive) constant model, and is used as a baseline for comparison.

Root Mean Square Error (RMSE) will be used to score and compare different models.

Multiple Linear Regression Model, Base Case

This simple base case uses the following features: ['month', 'dayofweek', 'hour', 'temp', 'humidity', 'holiday'].

These features were picked because intuitively, these variables would impact whether or not someone goes for a bike ride. Features like duration of the bike ride don't make sense as a predictor, because the model is trying to predict whether or not there will be a bike ride in the first place.

The function create_features will be used to convert the data into a dataframe that can be more easily understood by the linear regression model, with scaled numerical features and dummy encoded features.

The RMSE is quite a bit higher than the naive model, indicating that there is something wrong with our assumptions here. It may be more appropriate to not fit an intercept, as hourly rides may be close to 0 during the middle of the night.

The improvement is neglible. Different features may need to be selected.

Cyclical Time

The models below use the cyclical time features rather than the categorical time features.

The model that did not have an intercept performed remarkedly better than the one with an intercept. Either way, using cyclical time features improved the RMSE over the base case. The cyclical time features may be better at capturing the fluctuations in bikeshare demand given the hour of day and month.

Period/Yearly Patterns

As seen in the exploratory data analysis part of this report, there is an overall trend in the number of bikeshare users over the years. This trend may not be captured correctly by the model. It may be beneficial to split the data into years before fitting the model.

The resulting RMSE is the second lowest so far. Let us compare the models looked at so far in the chart below.

Other Models

This section will explore other models as well as some optimizations for feature selection.

Cyclical time features were not used here due to their nature; cyclical time features are encoded as 2 separate features/columns, and is incompatible with the feature selection/hyper parameter tuning done below.

Lasso Regression

The Lasso model can be used to look at the impact of the features and to see if any of the features should be excluded from the model. Lasso will assign a coefficient of 0 to features if they are deemed unnecessary to the model.

The RMSE is slightly better than the base case, which had an RMSE of about 183.6. Interestingly, the hyper-tuned parameters from GridSearchCV resulted in a model with a worse RMSE than the default Lasso parameters.

The next cell will check if the Lasso model set any of the coefficients to 0, essentially removing features from the model.

None of the features have been removed from the model, though some coefficients are significantly higher than others. The coefficients for the dummy-encoded features align with the exploratory data analysis done in part 2. For example, the coefficients for the hour_ features line up with hourly demand throughout the day; higher coefficients are seen for "rush hour" times.

Overall, it seems that Lasso regression had little to no impact on the model.

Ridge Regression

Try L2 regularization/Ridge Regression to see if there is any improvement on the RMSE.

The RMSE from ridge regression was slightly worse than the lasso regression model. This time, however, the hyper-parameter tuning using GridSearchCV did result in a better RMSE, but is almost negligble.

The cell below checks the coefficients of the resulting ridge regression model.

The coefficients are very similar to the Lasso regression model's coefficients.

Again, this model had little to no difference from the base linear model.

Polynomial Regression

A polynomial model with a higher degree may be able to better predict bikeshare demand.

The cell below creates 2nd degree polynomial features and fits them with a Linear Regression model.

The RMSE from this model is the highest so far.

More complex models outside the scope of this analysis may be required.

Testing the Model

Out of the models looked at above, the linear regression model using cyclical time features had the lowest RMSE.

However, because of the nature of linear regression models, it is possible for the model to predict values that are below 0. A prediction that is below 0 can be interpreted as 0 hourly rides, and the more negative the value, the more likely that that hour will have no demand.

The RMSE on the test dataset is close the validation RMSE of 119.59 for this model. This RMSE includes negative predicted values, however.

If the negative predicted values were taken as 0 instead, what is the resulting RMSE?

The RMSE is significantly lower, at about 60% of test_rmse was.

It is likely that Linear Regression is not the best model to predict hourly bikeshare demand. As seen in the exploratory data analysis, bikeshare demand has a strong cyclical correlation with month, day of week, and hour. These variables are likely not captured fully by linear regression.

The plot below will look at the residuals between the actual hourly rides in the test data vs the predicted hourly rides.

The plot above treats all negative predictions as 0, which creates the diagonal line in the scatter plot.

Ideally, the line in the plot above would be horizontal, indicating that the predictions and actual hourly rides are the same. The downward slope indicates that the model is consistently under-predicting at higher hourly rides.

This tendency to underpredict may be caused by the lack of an intercept in the model. Another possibility is that some features need to be weighted less and their effect on the number of riders may be overstated in the model.

Conclusions

The function below is an exampled of something that could be implemented, where a date and time, whether or not the weather is clear, the wind speed, temperature, and humidity are used as inputs and fed into the linear model.

The predicted hourly demand is then returned.

The resulting prediction from the cases above seem reasonable. Not many people would be biking at 4 am on a Sunday in -15 degree weather!

The other case is on a Wenesday, at 5 pm, the end of a typical workday and where a peak in demand is usually seen. This value of 245 seems low compared to the hourly demand that was seen in the exploratory data analysis section. The model may be giving the various weather features too high of a weight, which is decreasing the number of predicted rides.

Again, a more advanced model, such as tree based predictors, neural networks, etc., may be useful in predicting bikeshare demand than a simple linear regression model.