9.1 Steps to Build a Random Forest
- Randomly select \(k\) attributes from total \(m\) attributes where \(k < m\), the default value of \(k\) is generally \(\sqrt{m}\).
- Among the \(k\) attributes, calculate the node \(d\) using the best split point
- Split the node into a number of nodes using the best split method. See Section @ref(best_split), by default R random Forest, uses Gini impurity values
- Repeat the previous steps build an individual decision tree
- Build a forest by repeating all steps for \(n\) number times to create \(n\) number of trees
After the random forest trees and classifiers are created, predictions can be made using the following steps:
- Run the test data through the rules of each decision tree to predict the outcome and then
- Store that predicted target outcome
- Calculate the votes for each of the predicted targets
- Output the most highly voted predicted target as the final prediction
Similar to the decision tree model, the random forest also has many implementations already built. You do not need to write code to do the actual model construction. In R, you can use a package called ‘randomForest’. There are a number of terminologies that are used in random forest algorithms that need to be understood, such as:
Variance. When there is a change in the training data algorithm, this is the measure of that change. The most commonly used parameters to reflect changes are ntree and mtry.
Bagging. This is a variance-reducing method that trains the model based on random sub-samples of training data.
Out-of-bag (OOB) error estimate - The random forest classifier is trained using bootstrap aggregation, where each new tree is fit from a bootstrap sample of the training dataset. The out-of-bag (OOB) error is the average error for each calculation using predictions from the trees that do not contain their respective bootstrap sample. This enables the random forest classifier to be adjusted and validated during training.
Let’s now look at how we can implement the random forest algorithm for our Titanic prediction.
R provides 'randomForest'
package. You can check the details of the package for full usage. We will start with a direct function call with its default settings and we may change settings later. We will also use the original attributes first and then use re-engineered attributes to see if we can improve on the model.