NeuralNet R package - Neural network to predict Kaggle Bike Sharing Competition
Right now, Kaggle is hosting a competition to predict the usage of the Capital Bike Sharing system in Washington, DC.
Here’s a very simple model using the “neuralnet” package in R that will put you around 300th at the time of this writing, which is in the top third. With some slight changes, you should be able to move into the top quarter.
#install and load packages
install.packages(“lubridate”)
install.packages(“neuralnet”)
library(lubridate)
library(neuralnet)
Lubridate will help us break the datetime string down into something that the machine likes more. We’ll also use the neuralnet package in R, although in general neural networks are generally done in Python or Torch due to R’s limitations.
# import train and test
train <- read.csv(“~/filepath/train.csv”)
test <- read.csv(“~/filepath/test.csv”)
#make hour a separate variable
train$hour <- hour(train$datetime)
test$hour <- hour(test$datetime)
#make day a separate variable
train$day <- wday(train$datetime)
test$day <- wday(test$datetime)
#make year a separate variable
train$year <- year(train$datetime)
test$year <- year(test$datetime)
#write code to look for weather 4
train$weather[train$weather == 4]
test$weather[test$weather == 4]
Really, I should’ve showed you ways to visualize the data. Instead, I’m just showing you this.
#get rid of weather 4
train$weather[train$weather==4] <- 3
test$weather[test$weather==4] <- 3
After you run this, you’ll find that the train data only has one hour with a label of “4” while the test data has two. This is far from ideal, so let’s just go ahead and assign these to 3.
Now let’s make our variables a factor:
# make factor
train$season <- as.factor(train$season)
test$season <- as.factor(test$season)
train$workingday <- as.factor(train$workingday)
test$workingday <- as.factor(test$workingday)
train$weather <- as.factor(train$weather)
test$weather <- as.factor(test$weather)
train$year <- as.factor(train$year)
test$year <- as.factor(test$year)
train$day <- as.factor(train$day)
test$day <- as.factor(test$day)
train$hour <- as.factor(train$hour)
test$hour <- as.factor(test$hour)
The Neuralnet package is a bit picky, so this was the way that I got the data into something that made it happy, where I could also change my model fairly easily.
# turn train and test into matrices for dummy variables
trainmat <- model.matrix(count~season+workingday+weather+year+hour+day,data=train)
testmat <- model.matrix(~season+workingday+weather+year+hour+day,data=test)
Now lets turn our dummy variable matrices back into dataframes:
#turn matrices back into data frames because that seems to be what the neuralnet package likes
trainmat <- as.data.frame(trainmat)
testmat <- as.data.frame(testmat)
Now we’ll scale count:
#scale count
count <- train$count/1000
#add count to trainmat
trainmat <- cbind(trainmat,count)
#Write formula
formula <- count ~ season2+season3+season4+workingday1+weather2+weather3+year2012+hour1+hour2+hour3+hour4+hour5+hour6+hour7+hour8+hour9+hour10+hour11+hour12+hour13+hour14+hour15+hour16+hour17+hour18+hour19+hour20+hour21+hour22+hour23+day2+day3+day4+day5+day6+day7
#train your data. note that this is a neural network with 5 hidden layers of 7, 8, 9, 8, and 7 respectively.
fit <- neuralnet(formula,data=trainmat,hidden=c(7,8,9,8,7),threshold=.04,stepmax=1e+06,learningrate=.001,algorithm=”rprop+”,lifesign=”full”,likelihood=T)
Threshold controls how close you want to get to convergence. In other words, “early stopping.” Increase this number for faster processing and to decrease overfitting. Decrease this number for more precision but at the risk of possible overfitting. I’m using the rprop+ algorithm. I set the lifesign to “full” because I like to know that the sucker is still running. That’s particularly helpful when your algorithm takes a few hours to run.
Your mileage may vary, but this particular neural network only took a few minutes to run on my MacBook Air.
Now for some quick housecleaning, because the neuralnet package is finicky. Then we’ll use our model to make predictions using the test data, and assign them to a variable and re-scale them.
## Delete the first column. With the neural net package, you have to be careful to always make sure that the covariate matrix matches your test set. Columns must be in the same order.
testmat <- testmat[,2:38]
#Get predictions
predict <- compute(fit,testmat)
#Assign predictions to variable because compute produces more than we need
predict<- predict$net.rsult
#Rescale
predict<- predict*1000
Since this model is a bit overfitted, we’ll check to make sure we don’t have any values that are really low. Since we will, we’ll set the minimum prediction to an arbitrary number, 3.8
#Check for any negative variables
predict[predict<3]
# We’ll set the minimum prediction here to 3.8
predict[predict<3] <- 3.8
submit <- data.frame(datetime = test$datetime, count=predict)
write.csv(submit, file=”thanksEvan.csv”,row.names=FALSE)