9.5 Comparision the Three Random Forest Models
We have produced three random forest models, each has different performance in terms of prediction accuracy on the test dataset. Let us make a quick comparison among them.
library(tidyr)
Model <- c("RF_Model1","RF_Model2","RF_Model3")
Pre <- c("Sex, Pclass, HasCabinNum, Deck, Fare_pp", "Sex, Fare_pp, Pclass, Title, Age_group, Group_size, Ticket_class, Embarked", "Sex, Pclass, Age, SibSp, Parch, Embarked, HasCabinNum, Friend_size, Fare_pp, Title, Deck, Ticket_class, Family_size, Group_size, Age_group")
Learn <- c(80.0, 83.16, 83.0)
Train <- c(84, 92, 78)
Test <- c(76.555, 78.95, 77.03)
df1 <- data.frame(Model, Pre, Learn, Train, Test)
df2 <- data.frame(Model, Learn, Train, Test)
knitr::kable(df1, longtable = TRUE, booktabs = TRUE, digits = 2, col.names =c("Models", "Predictors", "Accuracy on Learn", "Accuracy on Train", "Accuracy on Test"),
caption = 'The Comparision among 3 Random Forest models'
)
Models | Predictors | Accuracy on Learn | Accuracy on Train | Accuracy on Test |
---|---|---|---|---|
RF_Model1 | Sex, Pclass, HasCabinNum, Deck, Fare_pp | 80.00 | 84 | 76.56 |
RF_Model2 | Sex, Fare_pp, Pclass, Title, Age_group, Group_size, Ticket_class, Embarked | 83.16 | 92 | 78.95 |
RF_Model3 | Sex, Pclass, Age, SibSp, Parch, Embarked, HasCabinNum, Friend_size, Fare_pp, Title, Deck, Ticket_class, Family_size, Group_size, Age_group | 83.00 | 78 | 77.03 |
df.long <- gather(df2, Dataset, Accuracy, -Model, factor_key =TRUE)
ggplot(data = df.long, aes(x = Model, y = Accuracy, fill = Dataset)) +
geom_col(position = position_dodge())
- It is not true that the more predictors the better performance with Random Forest models.
- The result of the model validation on the training dataset is not reliable. The higher accuracy on the training dataset does not mean a higher accuracy on the test dataset.
- All the model has a degree of overfitting. That is the accuracy on the test data is lower than the training dataset and even lower than the model its own estimated accuracy while learn or construct it.
- the cause of the overfitting is a complicated issue. It may be related to all the factors: the number of predictors used to build the model, the dataset used to build the model, and the model default parameters.
In comparison with the decision tree models, we have built in the previous Chapter. The random forest models over-perform all the four models on the test dataset. The lowest accuracy is the same as the highest accuracy with the decision tree models (76.55%).