Summary
In this chapter, we have demonstrated the process of fine-tuning a prediction model’s parameters so to achieve the best performance of the model to eliminate the possible model’s overfit. We not only performed the fine-tune of a model’s parameters but also demonstrated the process of fine-tune the other two factors that may cause the model’s overfitting. They are the train data sampling and the predictors’ selection.
We use the RF model as an example, starting from the order of all attributes prediction power, we have figured out the best collection of predictors including the number of predictors and the actual predictors. We’ve concluded the best predictor list is,
## [1] "Sex, Title, Fare_pp, Ticket_class, Pclass, Ticket, Age, Friend_size, Deck"
We also demonstrated the process of training dataset sampling. That is a basic technique used in the smallest dataset. Data sampling has a great impact on the model’s performance. We have demonstrated the technique using k-folds CV. We have concluded that the best training data sample is 3-folds with repeats 10 times
.
And finally, we demonstrated the methods used to fine-tune a model’s parameters. With RF, the only two parameters are: mtry
and ntree
. We have illustrated “Random search”, “Grid search” and “Manual search” methods and find out the best parameters, based on the fixed predictors and the sampling, are mtry = 3
and ntree = 1500
.
Let us use these parameters to produce a model on the training dataset and make a prediction on the test dataset. We can then submit the final result to Kaggle for evaluation.
set.seed(1234)
tunegrid <- expand.grid(.mtry = 3)
control <- trainControl(method="repeatedcv", number=3, repeats=10, search="grid")
# # # the following code have been commented out just for produce the markdown file. so it will not wait for ran a long time
# # # set up cluster for parallel computing
# cl <- makeCluster(6, type = "SOCK")
# registerDoSNOW(cl)
#
# Final_model <- train(y = rf.label, x = rf.train.8, method="rf", metric="Accuracy", tuneGrid=tunegrid, trControl= control, ntree=1500)
#
# #Shutdown cluster
# stopCluster(cl)
#
# save(Final_model, file = "./data/Final_model.rda")
# # # the above code commented out for output book file
load("./data/Final_model.rda")
# Make predictions
Prediction_Final <- predict(Final_model, test.submit.df)
#table(Prediction_Final)
# Write out a CSV file for submission to Kaggle
submit.df <- data.frame(PassengerId = test$PassengerId, Survived = Prediction_Final)
write.csv(submit.df, file = "./data/Prediction_Final.csv", row.names = FALSE)
We have got a score of 0.76076. Recall that our base model without fine-tune the Kaggle scores was 0.75598. It shows that our RF model has been increased by 0.5 percent. It seems not a lot but the technique and the process are far more important the increase of the accuracy.