12.4 Further Analysis
The previous section reports the constructed model (e.g. RF_model2
) in terms of how it comes about and what was its limitations:
The model
RF_model2
is not the best one and it is seriously overfitting. Its performance on the test dataset should be improved.If any further work is planned, then it should start by considering re-engineer
Title
andSex
since they are the most important predictors in modelRF_model2
.
This section will demonstrate how to improve a constructed model’s performance. We still use RF_model2
as an example. A good place to start is where it gets things wrong! To spot where things went wrong is difficult from numbers. A good technique is using graphs. However model RF_model2
has 500 decision trees. It is difficult to visualize 500 trees.
Recall that we have a decision tree model model3
. It has the same predictors as the RF_model2
. We can use this decision tree (see Figure 12.4 to find the place where things may go wrong.
From Figure 12.4, we can see that the single place that we got things wrong is the left branch of the first test condition, where the adult male passengers (as “Title = MR
”) has 81 passengers being wrongly predicted as survived. This is also confirmed by our model that the error rate of predicting passengers’ survival is higher than the error rate of predicting passengers’ perished. So re-engineer the attribute “Title
” is a good place to start. This also coincides with the suggestion from the previous section where the importance order of the predictors used in the RF_model2
.
Now we will just demonstrate how to further re-engineering the Title
attribute. The values of Title
in the train
dataset are as follows:
##
## Master Miss Mr Mrs Other
## 61 260 757 197 34
We can see that there are 34 records in the training dataset which has the value of Other
in the Title
attribute. It is a good place where further purification can be done.
Let us go back to the raw dataset and abstract title for the name attribute.
##
## Capt Col Don Dona Dr Jonkheer
## 1 4 1 1 8 1
## Lady Major Master Miss Mlle Mme
## 1 2 61 260 2 1
## Mr Mrs Ms Rev Sir the Countess
## 757 197 2 8 1 1
It becomes obvious that the value of Title
which has been categorized as other
is too simplified. We can abstract more information such as gender and age from them. That information is useful for the prediction. It is also inappropriate to keep them as separate categories since some of them have a small number of instances, use them could lead to overfitting of the model.
Further, bin or bucket them into a more appropriate category is required. We can do so with the knowledge of nobility, locality (country of origin), and other knowledge such as time (at the beginning of the 20 century). For example, “Dona
” and “the Countess
” are female nobility equivalent to “Lady
”, and “Ms
” and “Mlle
” are essentially the same with “Miss
”; “Mme
” is a military title equivalent to “Madame
”, so it can be categorized as “Mrs
”; “Jonkheer
” is an honorific nobility in the Netherlands; and “Don
” is the title of a university lecturer, they can be categorized as “Sir
”; “Col
”, “Capt
”, and “Major
” are military ranks and can be replaced with a more general title “Officer
”. With all of these, we can reduce the number of title’s category.
##
## Dr Lady Master Miss Mr Mrs Officer Rev Sir
## 8 3 61 264 757 198 7 8 3
We can convert Title
into a factor to plot their relations with the value of Survived
.
We could stop here since we have purified the Title’s value other with a more precise category in terms of semantic meaning. However, we notice that some values still have very small numbers. We should re-categorize those with small numbers categories like “Lady
” and “Sir
” into categories with larger numbers and keep the survival ratio as close as possible. We can categorize “Lady
” into “Mrs
”, “Sir
” and “Rev
” into “Mr
”, For neutral titles like “Dr
” and “Officer
”, we can categorize them into the title “Mr
” and “Mrs
” according to sex.
##
## Dr Lady Master Miss Mr Mrs Officer Rev Sir
## 0 0 61 264 782 202 0 0 0
We can check the title against gender to see if any mistakes made.
After re-categorized the small number of titles, we only have 4 categories of titles. From the plot, we can see their survival radio is matched with the Survive radio of the attribute Sex
.
We could use this re-engineered title attributes “New_Title
” to re-build RF models. The overall accuracy of the new models should be increased. The following code is an example of showing that. The new model has indeed increased the overall model’s prediction accuracy by 0.45%. It is not a lot but it approves the point that features re-engineer is a place to do a model’s performance improvement.
##
## Call:
## randomForest(formula = as.factor(Survived) ~ Sex + Fare_pp + Pclass + New_Title + Age_group + Group_size + Ticket_class + Embarked, data = RE_data[1:891, ], importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 16.05%
## Confusion matrix:
## 0 1 class.error
## 0 504 45 0.08196721
## 1 98 244 0.28654971
We can further do the same with many other attributes or a combination of multiple attributes.