9.3 Random Forest with More Variables
Now let us see if We can build a better model if we use more predictors. The predictor we are using is identical to the decision tree model3.
### RE_model2 with more predictors
set.seed(2222)
RF_model2 <- randomForest(as.factor(Survived) ~ Sex + Fare_pp + Pclass + Title + Age_group + Group_size + Ticket_class + Embarked,
data = train,
importance=TRUE)
# # This model will be used in later chapters, so save it in a file and it can be loaded later.
save(RF_model2, file = "./data/RF_model2.rda")
We can assess the new model,
##
## Call:
## randomForest(formula = as.factor(Survived) ~ Sex + Fare_pp + Pclass + Title + Age_group + Group_size + Ticket_class + Embarked, data = train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 16.84%
## Confusion matrix:
## 0 1 class.error
## 0 499 50 0.09107468
## 1 100 242 0.29239766
Notice that the default parameter ‘mtry = 2’ and ntree = 500
. It means the number of variables tried at each split is now 2 and the number of trees that can be built is 500. The model’s estimated OOB error rate
is 16.5%. It has an increase in comparison with the first model which was 20%. So the overall accuracy of the model has reached 83.5%.
Let us make a prediction on train Data to verify the model’s training accuracy.
# RF_model2 Prediction on train
RF_prediction2 <- predict(RF_model2, train)
#check up
conMat<- confusionMatrix(RF_prediction2, train$Survived)
conMat$table
## Reference
## Prediction 0 1
## 0 529 55
## 1 20 287
## [1] "Accuracy = 0.92"
## [1] "Error = 0.08"
We can see the model’s accuracy of the training dataset has reached 91%. The result shows that the prediction on survival has 55 wrong predictions out of 527 correct predictions; The prediction on death has 287 correct predictions and 22 wrong predictions. The overall accuracy reaches 91%. It is again higher than the model learning accuracy 83.5%.
It has also increased a bit comparing with the accuracy on the estimated accuracy 80% and the accuracy on train dataset 84% of the random forest RF_model1. Compare with the decision tree model3, which has identical predictors, the accuracy was 85% on the training dataset.
Let us make another submit to Kaggle to see if the prediction on unseen data has been improved.
# produce a submission and submit to Kaggle
test$Pclass <- as.factor(test$Pclass)
test$Group_size <- as.factor(test$Group_size)
#make prediction
RF_prediction <- predict(RF_model2, test)
submit <- data.frame(PassengerId = test$PassengerId, Survived = RF_prediction)
# Write it into a file "RF_Result.CSV"
write.csv(submit, file = "./data/RF_Result2.CSV", row.names = FALSE)
The feedback shows the prediction only increased a lot with a scored 0.78947! It has improved on the RF_model1 (0.76555) and decision tree model3 (0.77033).
Let us record these various accuracy.