9.3 Random Forest with More Variables

Now let us see if We can build a better model if we use more predictors. The predictor we are using is identical to the decision tree model3.

### RE_model2 with more predictors
set.seed(2222)
 RF_model2 <- randomForest(as.factor(Survived) ~ Sex + Fare_pp + Pclass + Title + Age_group + Group_size + Ticket_class  + Embarked, 
                          data = train,
                          importance=TRUE)
# # This model will be used in later chapters, so save it in a file and it can be loaded later.
 save(RF_model2, file = "./data/RF_model2.rda")

We can assess the new model,

load("./data/RF_model2.rda")
RF_model2

## 
## Call:
##  randomForest(formula = as.factor(Survived) ~ Sex + Fare_pp +      Pclass + Title + Age_group + Group_size + Ticket_class +      Embarked, data = train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 16.84%
## Confusion matrix:
##     0   1 class.error
## 0 499  50  0.09107468
## 1 100 242  0.29239766

Notice that the default parameter ‘mtry = 2’ and ntree = 500. It means the number of variables tried at each split is now 2 and the number of trees that can be built is 500. The model’s estimated OOB error rate is 16.5%. It has an increase in comparison with the first model which was 20%. So the overall accuracy of the model has reached 83.5%.

Let us make a prediction on train Data to verify the model’s training accuracy.

# RF_model2 Prediction on train
RF_prediction2 <- predict(RF_model2, train)
#check up
conMat<- confusionMatrix(RF_prediction2, train$Survived)
conMat$table

##           Reference
## Prediction   0   1
##          0 529  55
##          1  20 287

# Misclassification error
paste('Accuracy =', round(conMat$overall["Accuracy"],2))

## [1] "Accuracy = 0.92"

paste('Error =', round(mean(train$Survived != RF_prediction2), 2))

## [1] "Error = 0.08"

We can see the model’s accuracy of the training dataset has reached 91%. The result shows that the prediction on survival has 55 wrong predictions out of 527 correct predictions; The prediction on death has 287 correct predictions and 22 wrong predictions. The overall accuracy reaches 91%. It is again higher than the model learning accuracy 83.5%.

It has also increased a bit comparing with the accuracy on the estimated accuracy 80% and the accuracy on train dataset 84% of the random forest RF_model1. Compare with the decision tree model3, which has identical predictors, the accuracy was 85% on the training dataset.

Let us make another submit to Kaggle to see if the prediction on unseen data has been improved.

# produce a submission and submit to Kaggle 
test$Pclass <- as.factor(test$Pclass)
test$Group_size <- as.factor(test$Group_size)

#make prediction
RF_prediction <- predict(RF_model2, test)
submit <- data.frame(PassengerId = test$PassengerId, Survived = RF_prediction)
# Write it into a file "RF_Result.CSV"
write.csv(submit, file = "./data/RF_Result2.CSV", row.names = FALSE)

The feedback shows the prediction only increased a lot with a scored 0.78947! It has improved on the RF_model1 (0.76555) and decision tree model3 (0.77033).

Let us record these various accuracy.

# Record RF_model2's results
RF_model2_accuracy <- c(83.16, 92, 78.95)