12.4 Further Analysis

The previous section reports the constructed model (e.g. RF_model2) in terms of how it comes about and what was its limitations:

  1. The model RF_model2 is not the best one and it is seriously overfitting. Its performance on the test dataset should be improved.

  2. If any further work is planned, then it should start by considering re-engineer Title and Sex since they are the most important predictors in model RF_model2.

This section will demonstrate how to improve a constructed model’s performance. We still use RF_model2 as an example. A good place to start is where it gets things wrong! To spot where things went wrong is difficult from numbers. A good technique is using graphs. However model RF_model2 has 500 decision trees. It is difficult to visualize 500 trees.

Recall that we have a decision tree model model3. It has the same predictors as the RF_model2. We can use this decision tree (see Figure 12.4 to find the place where things may go wrong.

The simple decision tree of RF_model2

Figure 12.4: The simple decision tree of RF_model2

From Figure 12.4, we can see that the single place that we got things wrong is the left branch of the first test condition, where the adult male passengers (as “Title = MR”) has 81 passengers being wrongly predicted as survived. This is also confirmed by our model that the error rate of predicting passengers’ survival is higher than the error rate of predicting passengers’ perished. So re-engineer the attribute “Title” is a good place to start. This also coincides with the suggestion from the previous section where the importance order of the predictors used in the RF_model2.

Now we will just demonstrate how to further re-engineering the Title attribute. The values of Title in the train dataset are as follows:

## 
## Master   Miss     Mr    Mrs  Other 
##     61    260    757    197     34

We can see that there are 34 records in the training dataset which has the value of Other in the Title attribute. It is a good place where further purification can be done.

Let us go back to the raw dataset and abstract title for the name attribute.

## 
##         Capt          Col          Don         Dona           Dr     Jonkheer 
##            1            4            1            1            8            1 
##         Lady        Major       Master         Miss         Mlle          Mme 
##            1            2           61          260            2            1 
##           Mr          Mrs           Ms          Rev          Sir the Countess 
##          757          197            2            8            1            1

It becomes obvious that the value of Title which has been categorized as other is too simplified. We can abstract more information such as gender and age from them. That information is useful for the prediction. It is also inappropriate to keep them as separate categories since some of them have a small number of instances, use them could lead to overfitting of the model.

Further, bin or bucket them into a more appropriate category is required. We can do so with the knowledge of nobility, locality (country of origin), and other knowledge such as time (at the beginning of the 20 century). For example, “Dona” and “the Countess” are female nobility equivalent to “Lady”, and “Ms” and “Mlle” are essentially the same with “Miss”; “Mme” is a military title equivalent to “Madame”, so it can be categorized as “Mrs”; “Jonkheer” is an honorific nobility in the Netherlands; and “Don” is the title of a university lecturer, they can be categorized as “Sir”; “Col”, “Capt”, and “Major” are military ranks and can be replaced with a more general title “Officer”. With all of these, we can reduce the number of title’s category.

## 
##      Dr    Lady  Master    Miss      Mr     Mrs Officer     Rev     Sir 
##       8       3      61     264     757     198       7       8       3

We can convert Title into a factor to plot their relations with the value of Survived.

Surival Rates for new.Title

Figure 12.5: Surival Rates for new.Title

We could stop here since we have purified the Title’s value other with a more precise category in terms of semantic meaning. However, we notice that some values still have very small numbers. We should re-categorize those with small numbers categories like “Lady” and “Sir” into categories with larger numbers and keep the survival ratio as close as possible. We can categorize “Lady” into “Mrs”, “Sir” and “Rev” into “Mr”, For neutral titles like “Dr” and “Officer”, we can categorize them into the title “Mr” and “Mrs” according to sex.

## 
##      Dr    Lady  Master    Miss      Mr     Mrs Officer     Rev     Sir 
##       0       0      61     264     782     202       0       0       0

We can check the title against gender to see if any mistakes made.

Surival Rates for re-categorised new.Title

Figure 12.6: Surival Rates for re-categorised new.Title

After re-categorized the small number of titles, we only have 4 categories of titles. From the plot, we can see their survival radio is matched with the Survive radio of the attribute Sex.

We could use this re-engineered title attributes “New_Title” to re-build RF models. The overall accuracy of the new models should be increased. The following code is an example of showing that. The new model has indeed increased the overall model’s prediction accuracy by 0.45%. It is not a lot but it approves the point that features re-engineer is a place to do a model’s performance improvement.

## 
## Call:
##  randomForest(formula = as.factor(Survived) ~ Sex + Fare_pp +      Pclass + New_Title + Age_group + Group_size + Ticket_class +      Embarked, data = RE_data[1:891, ], importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 16.05%
## Confusion matrix:
##     0   1 class.error
## 0 504  45  0.08196721
## 1  98 244  0.28654971

We can further do the same with many other attributes or a combination of multiple attributes.