# NYC CitiBike [5] - Modeling

After considering the modeling framework, the next step is to realize the analysis.

As suggested in the previous schemas, in order to create the predictive model to predict daily subscriber/pass user, number of trips, we are going to combine several source of data into one `dataframe`

. Then running simple `randomForest`

algorithm and `GAM`

via `caret`

package (machine learning for `R`

).

Here is the required library:

1 | library(plyr) # require rbind.fill |

## New Subscriber / pass user

### Combining data

The base dataset comes from CitiBike System Data, and for this part, I am using the “Citi Bike Daily Ridership and Membership Data”. The data is aggregated in daily level which should be sufficient. The data retrieving and loading script could be found here.

Let us begin the data wrangling:

1 | # read the aggregated dataset |

### Correlation exploration

Next let’s visualize the correlation between each features:

1 | featurePlot(dt_join[,c("month", "wday", "mday", "Mean_TemperatureC","holiday" ,"season_num", "CloudCover","Max_TemperatureC", "Min_TemperatureC", "Precipitationmm")], dt_join$new_subscriber_per_day, main="New Subscriber") |

1 | featurePlot(dt_join[,c("month", "wday", "mday", "Mean_TemperatureC","holiday" ,"season_num", "CloudCover","Max_TemperatureC", "Min_TemperatureC", "Precipitationmm")], dt_join$new_customer_per_day_pass_1day, main="New 24-hour pass") |

1 | featurePlot(dt_join[,c("month", "wday", "mday", "Mean_TemperatureC","holiday" ,"season_num", "CloudCover","Max_TemperatureC", "Min_TemperatureC", "Precipitationmm")], dt_join$new_customer_per_day_pass_7days, main="New 7-day pass") |

From the plots, we can see some clear correlation between the predictor and outcome variable. As observed in the plot, the following variables will be in the model: month, season, day of the week, Cloud Cover, max temperature, average temperature, holiday, precipitation. As opposed, the day of the month and min temperature are dropped.

### Modeling

Now we are going to model the outcome variable with the selected predictor.

Some tips for modeling with `caret`

package:

- With the model object, we can use
`caret::varImp()`

function to print out the list of predictors’ importance. - Use
`caret::trainControl()`

, we can fine tuning the training part. For example, changing the methode for paramter searching, creating the prediction bounds to avoid unexpected outcome, etc…

First GAM model for Subscriber:

1 | fitControl <- trainControl(method = "timeslice", |

1 | # Generalized Additive Model using Splines |

1 | summary(gamfit$final) |

1 | ## |

1 | pred <- predict(gamfit, newdata=dt_complete) |

1 | # varImp(gamfit$finalModel) |

Next `randomForest`

model for subscriber:

1 | fitControl <- trainControl(method = "oob", |

1 | ## Loading required package: randomForest |

1 | rdForestfit |

1 | # Random Forest |

1 | rdForestfit$finalModel |

1 | ## |

1 | pred <- predict(rdForestfit, newdata=dt_complete) |

1 | # varImp(rdForestfit$finalModel) |

The following codes will model the new 24h pass and new 7-day pass per day. Same as those for subscriber.

1 | # Pass 24H ----------------------------- |

1 | # Generalized Additive Model using Splines |

1 | summary(gamfit$final) |

1 | ## |

1 | pred <- predict(gamfit, newdata=dt_complete) |

1 | # varImp(gamfit$finalModel) |

1 | # Random Forest |

1 | rdForestfit$finalModel |

1 | ## |

1 | pred <- predict(rdForestfit, newdata=dt_complete) |

1 | # varImp(rdForestfit$finalModel) |

1 | # Generalized Additive Model using Splines |

1 | summary(gamfit$final) |

1 | ## |

1 | pred <- predict(gamfit, newdata=dt_complete) |

1 | # varImp(gamfit$finalModel, scale=FALSE) |

1 | # Random Forest |

1 | rdForestfit$finalModel |

1 | ## |

1 | pred <- predict(rdForestfit, newdata=dt_complete) |

1 | # varImp(rdForestfit$finalModel) |

## Number of Trips

In the post, we have created new feature: `active membership`

.

With the information of new subscriber and pass user, we can use the prediction to create the feature `active membership`

for the targeted time.

So we firstly begin data combination:

### Combining data

This part we are going to use the previously created SQLite database to query the data. We only keep the data at daily level.

1 | # get sqlite database |

Now prepare the data frame for modeling.

1 | # complete cases |

### Correlation exploration

Now let’s visualize the correlation between the count of trips and the predictors. Here plots trips carried out by pass users (customer), annual subscriber and sum of them.

1 | pairs(~sum_count+(month)+ |

1 | pairs(~Customer+(month)+ |

1 | pairs(~Subscriber+(month)+ |

From the plots, we can see some clear correlation between the predictor and outcome variable. As observed in the plot, the following variables will be in the model: month, season, day of the week, Cloud Cover, max temperature, average temperature, holiday, precipitation and active membership.

### Modeling

We need to create the model for annual subscriber trips & customer trips, but it is possible to only create the model for the total trips.

1 | # Annual Subscriber ----------------------------------- |

1 | # Generalized Additive Model using Splines |

1 | summary(gamfit$final) |

1 | ## |

1 | # varImp(gamfit) |

1 | # saveRDS(gamfit, "input/mod/MOD_gam_subs_trip.rds") |

1 | # Random Forest |

1 | rdForestfit$finalModel |

1 | ## |

1 | # varImp(rdForestfit) |

1 | # saveRDS(rdForestfit, "input/mod/MOD_rf_subs_trip.rds") |

1 | # Generalized Additive Model using Splines |

1 | summary(gamfit$final) |

1 | ## |

1 | # varImp(gamfit) |

1 | # saveRDS(gamfit, "input/mod/MOD_gam_cust_trip.rds") |

1 | # Random Forest |

1 | rdForestfit$finalModel |

1 | ## |

1 | # varImp(rdForestfit) |

1 | # saveRDS(rdForestfit, "input/mod/MOD_rf_cust_trip.rds") |

## Summary

In this post, we have constructed the model with random forest and general addictive model and done some feature engineering to create some useful features: active membership, holiday, day of week, etc…

With the current model (having around 80% - 90% accuracy), we can begin the construction of shiny application.