8.4 The Decision Tree with More Predictors
We have seen that our 5 key predictor decision tree model has improved on the sex-only prediction model. However, we know that our re-engineered data has more dimensions that contain useful information. Let us see if we can improve the decision tree model with more predictors in addition to the correlation analysis and PCA analyses results. This time We add travel in groups, Age_group, embarked port and the title attributes Sex, Pclass, HasCabinNum, Deck, and Fare_pp.
# tree model3 construction using more predictors
model3 <- rpart(Survived ~ Sex + Fare_pp + Pclass + Title + Age_group + Group_size + Ticket_class + Embarked,
data=train,
method="class")
# This model will be used in later chapters so save it in to a file for later to be loaded into memory
save(model3, file = "./data/model3.rda")
#Assess prediction accuracy on train data
Predict_model3_train <- predict(model3, train, type = "class")
conMat <- confusionMatrix(as.factor(Predict_model3_train), as.factor(train$Survived))
conMat$table
## Reference
## Prediction 0 1
## 0 517 100
## 1 32 242
## Accuracy
## 0.8519
Our assessment about the model3’s accuracy on the train data shows the accuracy has increased to 85%. It is a big increase from 82% of model2. Let us use this model to make another prediction on the test dataset and see if the accuracy on the test dataset is also increased.
Prediction3 <- predict(model3, test, type = "class")
submit3<- data.frame(PassengerId = test$PassengerId, Survived = Prediction3)
write.csv(submit3, file = "./data/Tree_model3.CSV", row.names = FALSE)
After submitting it to Kaggle the feedback was 0.77033. This is a big improvement on the test dataset. Let us look into the difference between the last two predictions,
# plot our full house classifier
prp(model3, type = 0, extra = 1, under = TRUE)
# plot our full house classifier
fancyRpartPlot(model3)
Again, let us look into the difference between predicted values on the test dataset.
# build a comparison data frame to record each prediction results
compare <- data.frame(test$PassengerId, predict2 = Prediction2 , predict3 = Prediction3)
# Find differences
dif <- compare[compare[2] != compare[3], ]
#show dif
print.data.frame(dif, row.names = FALSE)
## test.PassengerId predict2 predict3
## 896 0 1
## 913 0 1
## 925 0 1
## 956 0 1
## 972 0 1
## 981 0 1
## 982 0 1
## 996 0 1
## 1009 0 1
## 1051 0 1
## 1053 0 1
## 1084 0 1
## 1086 0 1
## 1088 0 1
## 1093 0 1
## 1098 1 0
## 1106 1 0
## 1117 0 1
## 1136 0 1
## 1141 0 1
## 1155 0 1
## 1173 0 1
## 1175 0 1
## 1176 0 1
## 1183 0 1
## 1199 0 1
## 1205 1 0
## 1225 0 1
## 1231 0 1
## 1236 0 1
## 1239 1 0
## 1246 0 1
## 1284 0 1
## 1301 0 1
## 1309 0 1
There are 35 differences.