9.2 Random Forest with Key Predictors
The process of using randomForest
package to build an RF model is the same as the decision tree package rpart
. Note also if a dependent (response) variable is a factor, classification is assumed, otherwise, regression is assumed. So to uses randomForest
, we need to convert the dependent variable into a factor.
# convert variables into factor
# convert other attributes which really are categorical data but in form of numbers
train$Group_size <- as.factor(train$Group_size)
#confirm types
sapply(train, class)
## PassengerId Survived Pclass Sex Age SibSp
## "integer" "factor" "factor" "factor" "numeric" "integer"
## Parch Ticket Embarked HasCabinNum Friend_size Fare_pp
## "integer" "factor" "factor" "factor" "integer" "numeric"
## Title Deck Ticket_class Family_size Group_size Age_group
## "factor" "factor" "factor" "integer" "factor" "factor"
Let us use the same five most related attributes: Pclass, Sex, HasCabinNum, Deck and Fare_pp in the decision tree model2. We use all default parameters of the randomForest
.
# Build the random forest model uses pclass, sex, HasCabinNum, Deck and Fare_pp
set.seed(1234) #for reproduction
RF_model1 <- randomForest(as.factor(Survived) ~ Sex + Pclass + HasCabinNum + Deck + Fare_pp, data=train, importance=TRUE)
save(RF_model1, file = "./data/RF_model1.rda")
Let us check model’s prediction accuracy.
##
## Call:
## randomForest(formula = as.factor(Survived) ~ Sex + Pclass + HasCabinNum + Deck + Fare_pp, data = train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 19.3%
## Confusion matrix:
## 0 1 class.error
## 0 505 44 0.08014572
## 1 128 214 0.37426901
We can see that the model uses default parameters: ntree = 500
and mtry = 1
. The model’s estimated accuracy is 80.7%. It is 1 - 19.3% (OOB error
).
Let us make a prediction on the training dataset and check the accuracy.
# Make your prediction using the validate dataset
RF_prediction1 <- predict(RF_model1, train)
#check up
conMat<- confusionMatrix(RF_prediction1, train$Survived)
conMat$table
## Reference
## Prediction 0 1
## 0 521 112
## 1 28 230
## [1] "Accuracy = 0.84"
## [1] "Error = 0.16"
We can see that prediction on the training dataset has achieved 84% accuracy. It has made 107 wrong predictions and 516 correct predictions on death. The prediction on survived is 33 wrong predictions out of 235 correct predictions.
The model has an accuracy of 80% after learning, but our evaluation of the training dataset achieves 84%. It has been increased. Compare with the decision tree model2, in which the same attributes were used and the prediction accuracy on the train data was 81%, the accuracy is also increased. Let us make a prediction on the test dataset and submit it to Kaggle to obtain an accuracy score.
# produce a submit with Kaggle required format that is only two attributes: PassengerId and Survived
test$Pclass <- as.factor(test$Pclass)
test$Group_size <- as.factor(test$Group_size)
#make prediction
RF_prediction <- predict(RF_model1, test)
submit <- data.frame(PassengerId = test$PassengerId, Survived = RF_prediction)
# Write it into a file "RF_Result.CSV"
write.csv(submit, file = "./data/RF_Result1.CSV", row.names = FALSE)
We can see our random forest model has scored 0.76555 by the Kaggle competition. It is interesting to know that the random forest model has not improved on the test dataset compare with the decision tree model with the same predictors. The accuracy was also 0.76555.
Let us record these accuracies,