11.1 Tuning a model’s Predictor
From our Decision tree and Random forest model constructions, we’ve learned that the predictors can affect a model’s performance. The techniques used to select the best predictors is Correlation analysis and Association analysis, to select predictors that have no correlation among the predictors and strong association with the dependent variable, in other words, the attributes have the prediction power. Some models can provide some measurements of contributions of each predictor to the response variable. Such as in Random forest model, the user can specify the importance
parameter to true
(See Chapter 9), and let the model record the predictors’ importance
and this importance can be used for post-model-construction analysis. If we have the predictors’ importance, the tuning of the number of the predictors will become simple. We can simply take “bottom-up” or “top-down” approaches to tune the number of predictors until the best model accuracy has been achieved.
Let us use Random forest model to demonstrate this process. Recall that we have built three random forest models (in Chapter 9), each has different predictors and model’s accuracy (see Table 9.1). Let us ignore the overfitting issue at the moment and focus on the predictor’s impact on the model’s accuracy. Among the three models, both model1
and model3
have lower accuracy than model2
. Model2
has more predictors than model1
and fewer predictors than model3
. It reveals a principle which we have mentioned earlier, that is more predictors not necessarily mean higher accuracy.
To fine-tune the predictors, let us use the “top-down” approach. We start from the use of all attributes and gradually reduce the predictors by removing the least important attribute until the last attribute. We can compare the models’ accuracy and select the highest model.
# load necessary library
library(randomForest)
library(plyr)
library(caret)
# load our re-engineered data set and separate train and test datasets
RE_data <- read.csv("./data/RE_data.csv", header = TRUE)
train <- RE_data[1:891, ]
test <- RE_data[892:1309, ]
# Train a Random Forest with the default parameters using full attributes
# Survived is our response variable and the rest can be predictors except pasengerID.
rf.train <- subset(train, select = -c(PassengerId, Survived))
rf.label <- as.factor(train$Survived)
#RandomForest cannot handle factors with over 53 levels
rf.train$Ticket <- as.numeric(train$Ticket)
set.seed(1234) # for reproduction
# FT_rf.1 <- randomForest(x = rf.train, y = rf.label, importance = TRUE)
# save(FT_rf.1, file = "./data/FT_rf.1.rda")
load("./data/FT_rf.1.rda")
FT_rf.1
##
## Call:
## randomForest(x = rf.train, y = rf.label, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 14.93%
## Confusion matrix:
## 0 1 class.error
## 0 491 58 0.1056466
## 1 75 267 0.2192982
#rf.1 model with full house predictors has error rate: 15.49%
# Check the order of the predictors prediction power.
pre.or <- sort(FT_rf.1$importance[,3], decreasing = TRUE)
pre.or
## Title Sex Fare_pp Pclass Ticket_class Ticket
## 0.084673918 0.083622938 0.034991513 0.031993819 0.031729670 0.026882064
## Age Friend_size Deck Age_group Group_size Family_size
## 0.022106781 0.016718555 0.013445769 0.012993600 0.011544400 0.010971526
## HasCabinNum Embarked SibSp Parch
## 0.008285454 0.005295491 0.004567226 0.002776599
We have obtained the “full-house” model’s accuracy 85.07%, that is 1 - prediction error
(14.93%).
We have also obtained the order of the predictor’s prediction power, which is from the most to the least in the following order: “Sex, Title, Fare_pp, Ticket_class, Pclass, Ticket, Age, Friend_size, Deck, Age_group, Group_size, Family_size, HasCabinNum, SibSp, Embarked, Parch”.
We now can repeat the process by removing one attribute from the end of the list above and train a new Random forest model such as FT_rf.2
, FT_rf.1.3
, … until FT_rf.1.16
. We can compare the models’ OOB error
or Accuracy
to find out which model has the highest accuracy.
We only shows FT_rf.2
as an example in here,
# rf.2 as an example
rf.train.2 <- subset(rf.train, select = -c(Parch))
set.seed(1234)
# FT_rf.2 <- randomForest(x = rf.train.2, y = rf.label, importance = TRUE)
# save(FT_rf.2, file = "./data/FT_rf.2.rda")
load("./data/FT_rf.2.rda")
FT_rf.2
##
## Call:
## randomForest(x = rf.train.2, y = rf.label, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 15.38%
## Confusion matrix:
## 0 1 class.error
## 0 492 57 0.1038251
## 1 80 262 0.2339181
We have obtained the model FT_rf.2
, in which the attribute “Parch” has been removed since it is the last attributes in the attributes’ prediction power list, which means it has the least prediction power.
The FT_rf.2 model’s accuracy 84.62%, that is 1 - prediction error
(15.38%).
We can carry on the process until the last attribute Sex. We will then have a list of models and each has its estimated accuracy (We will not repeat the process in here and leave it to the exercise).
Once we have completed the process, the results can be listed and compared as in the Table 11.1.
library(tidyr)
Model <- c("rf.1","rf.2","rf.3","rf.4","rf.5","rf.6","rf.7","rf.8","rf.9","rf.10","rf.11","rf.12","rf.13","rf.14","rf.15","rf.16")
Pre <- c("Sex", "Title", "Fare_pp", "Ticket_class", "Pclass", "Ticket", "Age", "Friend_size", "Deck", "Age_group", "Group_size", "Family_size", "HasCabinNum", "SibSp", "Embarked", "Parch")
#Produce models predictor list
Pred <- rnorm(16)
tem <- NULL
for (i in 1:length(Pre)) {
tem <- paste(tem, Pre[i], sep = " ")
#Using environment variable setting
ls <- paste("Pred[",i,"]", sep="")
eq <- paste(paste(ls, "tem", sep="<-"), collapse=";")
eval(parse(text=eq))
}
Pred <- sort(Pred, decreasing = TRUE)
Error <- c(15.49, 15.15, 14.93, 15.26, 14.7, 14.7, 14.03, 13.58, 14.48, 15.6, 16.27, 16.95, 17.51, 20.31, 20.76, 21.32)
Accuracy <- 100 - Error
df <- data.frame(Model, Pred, Accuracy)
knitr::kable(df, longtable = TRUE, booktabs = TRUE, digits = 2, col.names =c("Models", "Predictors", "Accuracy"),
caption = 'Model Predictors Comparision'
)
Models | Predictors | Accuracy |
---|---|---|
rf.1 | Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size Deck Age_group Group_size Family_size HasCabinNum SibSp Embarked Parch | 84.51 |
rf.2 | Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size Deck Age_group Group_size Family_size HasCabinNum SibSp Embarked | 84.85 |
rf.3 | Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size Deck Age_group Group_size Family_size HasCabinNum SibSp | 85.07 |
rf.4 | Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size Deck Age_group Group_size Family_size HasCabinNum | 84.74 |
rf.5 | Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size Deck Age_group Group_size Family_size | 85.30 |
rf.6 | Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size Deck Age_group Group_size | 85.30 |
rf.7 | Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size Deck Age_group | 85.97 |
rf.8 | Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size Deck | 86.42 |
rf.9 | Sex Title Fare_pp Ticket_class Pclass Ticket Age Friend_size | 85.52 |
rf.10 | Sex Title Fare_pp Ticket_class Pclass Ticket Age | 84.40 |
rf.11 | Sex Title Fare_pp Ticket_class Pclass Ticket | 83.73 |
rf.12 | Sex Title Fare_pp Ticket_class Pclass | 83.05 |
rf.13 | Sex Title Fare_pp Ticket_class | 82.49 |
rf.14 | Sex Title Fare_pp | 79.69 |
rf.15 | Sex Title | 79.24 |
rf.16 | Sex | 78.68 |
From the table we can see that the best model is FT_rf.8. Its accuracy reaches 86.42 and the predictors are:
# load the best model and record its predictors
# save(FT_rf.8, file = "./data/FT_rf.8.rda")
load("./data/FT_rf.8.rda")
Predictor <- c("Sex, Title, Fare_pp, Ticket_class, Pclass, Ticket, Age, Friend_size, Deck")
Predictor
## [1] "Sex, Title, Fare_pp, Ticket_class, Pclass, Ticket, Age, Friend_size, Deck"
Of course, you can try different combinations of the predictors. The idea will be the same.
Some other models support predictor fine-tune. For example, Logistic Regression model glm
provides Stepwise attributes prediction power analysing. One can use “backward” Step-wise search to compare models’ AIC1 to find the best model and its predictors.
The Akaike information criterion (AIC) is an estimator of out-of-sample prediction error and thereby the relative quality of models for a given dataset. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.↩︎