11.1 Tuning a model’s Predictor

From our Decision tree and Random forest model constructions, we’ve learned that the predictors can affect a model’s performance. The techniques used to select the best predictors is Correlation analysis and Association analysis, to select predictors that have no correlation among the predictors and strong association with the dependent variable, in other words, the attributes have the prediction power. Some models can provide some measurements of contributions of each predictor to the response variable. Such as in Random forest model, the user can specify the importance parameter to true (See Chapter 9), and let the model record the predictors’ importance and this importance can be used for post-model-construction analysis. If we have the predictors’ importance, the tuning of the number of the predictors will become simple. We can simply take “bottom-up” or “top-down” approaches to tune the number of predictors until the best model accuracy has been achieved.

Let us use Random forest model to demonstrate this process. Recall that we have built three random forest models (in Chapter 9), each has different predictors and model’s accuracy (see Table 9.1). Let us ignore the overfitting issue at the moment and focus on the predictor’s impact on the model’s accuracy. Among the three models, both model1 and model3 have lower accuracy than model2. Model2 has more predictors than model1 and fewer predictors than model3. It reveals a principle which we have mentioned earlier, that is more predictors not necessarily mean higher accuracy.

To fine-tune the predictors, let us use the “top-down” approach. We start from the use of all attributes and gradually reduce the predictors by removing the least important attribute until the last attribute. We can compare the models’ accuracy and select the highest model.

# load necessary library
library(randomForest)
library(plyr)
library(caret)

# load our re-engineered data set and separate train and test datasets
RE_data <- read.csv("./data/RE_data.csv", header = TRUE)
train <- RE_data[1:891, ]
test <- RE_data[892:1309, ]

# Train a Random Forest with the default parameters using full attributes
# Survived is our response variable and the rest can be predictors except pasengerID. 
rf.train <- subset(train, select = -c(PassengerId, Survived))
rf.label <- as.factor(train$Survived)

#RandomForest cannot handle factors with over 53 levels
rf.train$Ticket <- as.numeric(train$Ticket)

set.seed(1234) # for reproduction 
# FT_rf.1 <- randomForest(x = rf.train, y = rf.label, importance = TRUE)
# save(FT_rf.1, file = "./data/FT_rf.1.rda")
load("./data/FT_rf.1.rda")
FT_rf.1

## 
## Call:
##  randomForest(x = rf.train, y = rf.label, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 14.93%
## Confusion matrix:
##     0   1 class.error
## 0 491  58   0.1056466
## 1  75 267   0.2192982

#rf.1 model with full house predictors has error rate: 15.49% 
# Check the order of the predictors prediction power.
pre.or <- sort(FT_rf.1$importance[,3], decreasing = TRUE)
pre.or

##        Title          Sex      Fare_pp       Pclass Ticket_class       Ticket 
##  0.084673918  0.083622938  0.034991513  0.031993819  0.031729670  0.026882064 
##          Age  Friend_size         Deck    Age_group   Group_size  Family_size 
##  0.022106781  0.016718555  0.013445769  0.012993600  0.011544400  0.010971526 
##  HasCabinNum     Embarked        SibSp        Parch 
##  0.008285454  0.005295491  0.004567226  0.002776599

varImpPlot(FT_rf.1, main="Ordered predictors measurements")

Figure 11.1: The importance of the predictors

We have obtained the “full-house” model’s accuracy 85.07%, that is 1 - prediction error (14.93%).

We have also obtained the order of the predictor’s prediction power, which is from the most to the least in the following order: “Sex, Title, Fare_pp, Ticket_class, Pclass, Ticket, Age, Friend_size, Deck, Age_group, Group_size, Family_size, HasCabinNum, SibSp, Embarked, Parch”.

We now can repeat the process by removing one attribute from the end of the list above and train a new Random forest model such as FT_rf.2, FT_rf.1.3, … until FT_rf.1.16. We can compare the models’ OOB error or Accuracy to find out which model has the highest accuracy.

We only shows FT_rf.2 as an example in here,

# rf.2 as an example
rf.train.2 <- subset(rf.train, select = -c(Parch))
set.seed(1234)
# FT_rf.2 <- randomForest(x = rf.train.2, y = rf.label, importance = TRUE)
# save(FT_rf.2, file = "./data/FT_rf.2.rda")
load("./data/FT_rf.2.rda")
FT_rf.2

## 
## Call:
##  randomForest(x = rf.train.2, y = rf.label, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 15.38%
## Confusion matrix:
##     0   1 class.error
## 0 492  57   0.1038251
## 1  80 262   0.2339181

We have obtained the model FT_rf.2, in which the attribute “Parch” has been removed since it is the last attributes in the attributes’ prediction power list, which means it has the least prediction power.

The FT_rf.2 model’s accuracy 84.62%, that is 1 - prediction error (15.38%).

We can carry on the process until the last attribute Sex. We will then have a list of models and each has its estimated accuracy (We will not repeat the process in here and leave it to the exercise).

Once we have completed the process, the results can be listed and compared as in the Table 11.1.

library(tidyr)

Model <- c("rf.1","rf.2","rf.3","rf.4","rf.5","rf.6","rf.7","rf.8","rf.9","rf.10","rf.11","rf.12","rf.13","rf.14","rf.15","rf.16")
Pre <- c("Sex", "Title", "Fare_pp", "Ticket_class", "Pclass", "Ticket", "Age", "Friend_size", "Deck", "Age_group", "Group_size", "Family_size", "HasCabinNum", "SibSp", "Embarked", "Parch")
#Produce models predictor list
Pred <- rnorm(16)
tem <- NULL
for (i in 1:length(Pre)) {
  tem  <- paste(tem, Pre[i], sep = " ")
#Using environment variable setting    
  ls  <- paste("Pred[",i,"]", sep="")
  eq  <- paste(paste(ls, "tem", sep="<-"), collapse=";")  
  eval(parse(text=eq)) 
  }
Pred <- sort(Pred, decreasing = TRUE)

Error <- c(15.49, 15.15, 14.93, 15.26, 14.7, 14.7, 14.03, 13.58, 14.48, 15.6, 16.27, 16.95, 17.51, 20.31, 20.76, 21.32)
Accuracy <- 100 - Error
df <- data.frame(Model, Pred, Accuracy)

knitr::kable(df, longtable = TRUE, booktabs = TRUE, digits = 2, col.names =c("Models", "Predictors", "Accuracy"), 
  caption = 'Model Predictors Comparision'
)

Table 11.1: Model Predictors Comparision
Models	Predictors	Accuracy
rf.1	Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size Deck Age_group Group_size Family_size HasCabinNum SibSp Embarked Parch	84.51
rf.2	Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size Deck Age_group Group_size Family_size HasCabinNum SibSp Embarked	84.85
rf.3	Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size Deck Age_group Group_size Family_size HasCabinNum SibSp	85.07
rf.4	Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size Deck Age_group Group_size Family_size HasCabinNum	84.74
rf.5	Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size Deck Age_group Group_size Family_size	85.30
rf.6	Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size Deck Age_group Group_size	85.30
rf.7	Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size Deck Age_group	85.97
rf.8	Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size Deck	86.42
rf.9	Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size	85.52
rf.10	Sex Title Fare_pp Ticket_class Pclass Ticket Age	84.40
rf.11	Sex Title Fare_pp Ticket_class Pclass Ticket	83.73
rf.12	Sex Title Fare_pp Ticket_class Pclass	83.05
rf.13	Sex Title Fare_pp Ticket_class	82.49
rf.14	Sex Title Fare_pp	79.69
rf.15	Sex Title	79.24
rf.16	Sex	78.68

From the table we can see that the best model is FT_rf.8. Its accuracy reaches 86.42 and the predictors are:

# load the best model and record its predictors
# save(FT_rf.8, file = "./data/FT_rf.8.rda")

load("./data/FT_rf.8.rda")
Predictor <- c("Sex, Title, Fare_pp, Ticket_class, Pclass, Ticket, Age, Friend_size, Deck")
Predictor

## [1] "Sex, Title, Fare_pp, Ticket_class, Pclass, Ticket, Age, Friend_size, Deck"

Of course, you can try different combinations of the predictors. The idea will be the same.

Some other models support predictor fine-tune. For example, Logistic Regression model glm provides Stepwise attributes prediction power analysing. One can use “backward” Step-wise search to compare models’ AIC¹ to find the best model and its predictors.

The Akaike information criterion (AIC) is an estimator of out-of-sample prediction error and thereby the relative quality of models for a given dataset. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.↩︎