7.3 Attributes Correlation Analysis

We have re-engineered the Titanic dataset. So instead of using the original dataset, let us consider the correlation among attributes of our re-engineered dataset.

## Rows: 1,309
## Columns: 18
## $ PassengerId  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Survived     <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, ...
## $ Pclass       <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, ...
## $ Sex          <fct> male, female, female, female, male, male, male, male, ...
## $ Age          <dbl> 22.00000, 38.00000, 26.00000, 35.00000, 35.00000, 27.4...
## $ SibSp        <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, ...
## $ Parch        <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, ...
## $ Ticket       <fct> A/5 21171, PC 17599, STON/O2. 3101282, 113803, 373450,...
## $ Embarked     <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, ...
## $ HasCabinNum  <fct> No, Yes, No, Yes, No, No, Yes, No, No, No, Yes, Yes, N...
## $ Friend_size  <int> 1, 2, 1, 2, 1, 1, 2, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, ...
## $ Fare_pp      <dbl> 7.250000, 35.641650, 7.925000, 26.550000, 8.050000, 8....
## $ Title        <fct> Mr, Mrs, Miss, Mrs, Mr, Mr, Mr, Master, Mrs, Mrs, Miss...
## $ Deck         <fct> U, C, U, C, U, U, E, U, U, U, G, C, U, U, U, U, U, U, ...
## $ Ticket_class <fct> A, P, S, 1, 3, 3, 1, 3, 3, 2, P, 1, A, 3, 3, 2, 3, 2, ...
## $ Family_size  <int> 2, 2, 1, 2, 1, 1, 1, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, ...
## $ Group_size   <int> 2, 2, 1, 2, 1, 1, 2, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, ...
## $ Age_group    <fct> 20-29, 30-39, 20-29, 30-39, 30-39, 20-29, 50-59, 0-9, ...
##   PassengerId      Survived          Pclass          Sex           Age       
##  Min.   :   1   Min.   :0.0000   Min.   :1.000   female:466   Min.   : 0.17  
##  1st Qu.: 328   1st Qu.:0.0000   1st Qu.:2.000   male  :843   1st Qu.:22.00  
##  Median : 655   Median :0.0000   Median :3.000                Median :27.43  
##  Mean   : 655   Mean   :0.3838   Mean   :2.295                Mean   :29.63  
##  3rd Qu.: 982   3rd Qu.:1.0000   3rd Qu.:3.000                3rd Qu.:37.00  
##  Max.   :1309   Max.   :1.0000   Max.   :3.000                Max.   :80.00  
##                 NA's   :418                                                  
##      SibSp            Parch            Ticket     Embarked HasCabinNum
##  Min.   :0.0000   Min.   :0.000   CA. 2343:  11   C:272    No :1014   
##  1st Qu.:0.0000   1st Qu.:0.000   1601    :   8   Q:123    Yes: 295   
##  Median :0.0000   Median :0.000   CA 2144 :   8   S:914               
##  Mean   :0.4989   Mean   :0.385   3101295 :   7                       
##  3rd Qu.:1.0000   3rd Qu.:0.000   347077  :   7                       
##  Max.   :8.0000   Max.   :9.000   347082  :   7                       
##                                   (Other) :1261                       
##   Friend_size      Fare_pp           Title          Deck       Ticket_class
##  Min.   : 1.0   Min.   :  0.000   Master: 61   U      :1014   3      :429  
##  1st Qu.: 1.0   1st Qu.:  7.579   Miss  :260   C      :  94   2      :278  
##  Median : 1.0   Median :  8.050   Mr    :757   B      :  65   1      :210  
##  Mean   : 2.1   Mean   : 14.765   Mrs   :197   D      :  46   P      : 98  
##  3rd Qu.: 3.0   3rd Qu.: 15.000   Other : 34   E      :  41   S      : 98  
##  Max.   :11.0   Max.   :128.082                A      :  22   C      : 77  
##                                                (Other):  27   (Other):119  
##   Family_size       Group_size       Age_group  
##  Min.   : 1.000   Min.   : 1.000   20-29  :552  
##  1st Qu.: 1.000   1st Qu.: 1.000   30-39  :229  
##  Median : 1.000   Median : 1.000   40-49  :171  
##  Mean   : 1.884   Mean   : 2.194   10-19  :162  
##  3rd Qu.: 2.000   3rd Qu.: 3.000   0-9    :100  
##  Max.   :11.000   Max.   :11.000   50-59  : 62  
##                                    (Other): 33

A quick correlation plot of the numeric attributes to get an idea of how they might relate to one another. You can see that we have dropped two chr attributes: Title and Deck. We could include them if we convert the character value into numbers. For example, the Title could be converted into 1-6 numbers as 1 represents Mr, 2 represents Mrs, and so on.

Correlation among numerical attributes

Figure 7.1: Correlation among numerical attributes

# show correlation in table
library(kableExtra) # markdown tables 
## Warning: package 'kableExtra' was built under R version 3.6.3
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
lower <- round(cor,2)
lower[lower.tri(cor, diag=TRUE)]<-""
lower <- as.data.frame(lower)
knitr::kable(lower, booktabs = TRUE,
  caption = 'Coorelations among attributes') %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", font_size = 8))
Table 7.1: Coorelations among attributes
Survived Pclass Sex Age SibSp Parch Embarked HasCabinNum Friend_size Fare_pp Title Deck Ticket_class Family_size Group_size Age_group
Survived -0.34 -0.54 -0.05 -0.04 0.08 -0.17 0.32 0.07 0.29 -0.05 -0.3 -0.04 0.02 0.08 -0.03
Pclass 0.12 -0.43 0.06 0.02 0.19 -0.71 -0.08 -0.77 -0.22 0.73 -0.02 0.05 -0.07 -0.45
Sex 0.06 -0.11 -0.21 0.1 -0.14 -0.17 -0.12 0.01 0.13 0 -0.19 -0.2 0.07
Age -0.27 -0.15 -0.07 0.3 -0.2 0.38 0.49 -0.32 0 -0.26 -0.2 0.98
SibSp 0.37 0.07 -0.01 0.68 -0.05 -0.2 0.01 0.05 0.86 0.73 -0.27
Parch 0.05 0.04 0.65 -0.03 -0.09 -0.03 0.06 0.79 0.67 -0.14
Embarked -0.21 0.01 -0.3 -0.03 0.24 -0.04 0.07 0.02 -0.06
HasCabinNum 0.1 0.65 0.14 -0.96 -0.03 0.01 0.09 0.3
Friend_size 0.09 -0.19 -0.1 0.12 0.8 0.97 -0.2
Fare_pp 0.18 -0.7 0.1 -0.05 0.09 0.38
Title -0.15 0.01 -0.18 -0.18 0.48
Deck 0.02 -0.01 -0.1 -0.32
Ticket_class 0.06 0.11 0.01
Family_size 0.85 -0.26
Group_size -0.2
Age_group

The plot shows not only the correlation between other attributes, which can potentially be used as predictors, with the dependent attribute Survived, but also the correction among potential predictors. In terms of correlation with Survived, Sex has the largest value but in negative -0.54, the next is Pclass with -0.34. So if we can only have two predictors for survival, the first two we should use are Sex and Pclass. If I want to choose five predictors for survival, I would choose Sex, Pclass, HasCabinNum, Deck, and Fare_PP.

The largest correlation value is between Age and Age_group with 0.98. It makes sense because Age_group is a grouping of Age.

We can also observe that Pclass has a high correlation with HasCabinNum (71), Fare_pp (-77) and Deck (73). It suggests that if we have Pclass in our model, we may not need to use Fare_pp, HasCabinNum or Deck since they are effectively telling us the same thing that we have suspected at the beginning that is the “social class” of a passenger. This social class can be interpreted as the richer people, who paid more money on a ticket, has a better cabin.

A similar concept can be read between attribute Group_size and the other three attributes Friend_Size, Family_size, SibSp, and Parch. Family_size also has a high correlation with both SibSp and Parch. But Family_size has a very low correlation with Friend_Size.

The important point is that the correlation analysis is very useful. It provides the basic reasons for our predictor selection. The idea is that we should choose attributes that have a high correlation with the response variable. For example, if only choose three predictors in a model to predict Survived, we should choose the Sex, Pclass, and HasCabinNum because they have the three highest absolute correlation values with Survived. If in a model we have chosen Pclass we may not need to choose Fare_pp and Deck because these three have large correction values.