7.3 Attributes Correlation Analysis
We have re-engineered the Titanic dataset. So instead of using the original dataset, let us consider the correlation among attributes of our re-engineered dataset.
## Rows: 1,309
## Columns: 18
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, ...
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, ...
## $ Sex <fct> male, female, female, female, male, male, male, male, ...
## $ Age <dbl> 22.00000, 38.00000, 26.00000, 35.00000, 35.00000, 27.4...
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, ...
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, ...
## $ Ticket <fct> A/5 21171, PC 17599, STON/O2. 3101282, 113803, 373450,...
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, ...
## $ HasCabinNum <fct> No, Yes, No, Yes, No, No, Yes, No, No, No, Yes, Yes, N...
## $ Friend_size <int> 1, 2, 1, 2, 1, 1, 2, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, ...
## $ Fare_pp <dbl> 7.250000, 35.641650, 7.925000, 26.550000, 8.050000, 8....
## $ Title <fct> Mr, Mrs, Miss, Mrs, Mr, Mr, Mr, Master, Mrs, Mrs, Miss...
## $ Deck <fct> U, C, U, C, U, U, E, U, U, U, G, C, U, U, U, U, U, U, ...
## $ Ticket_class <fct> A, P, S, 1, 3, 3, 1, 3, 3, 2, P, 1, A, 3, 3, 2, 3, 2, ...
## $ Family_size <int> 2, 2, 1, 2, 1, 1, 1, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, ...
## $ Group_size <int> 2, 2, 1, 2, 1, 1, 2, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, ...
## $ Age_group <fct> 20-29, 30-39, 20-29, 30-39, 30-39, 20-29, 50-59, 0-9, ...
## PassengerId Survived Pclass Sex Age
## Min. : 1 Min. :0.0000 Min. :1.000 female:466 Min. : 0.17
## 1st Qu.: 328 1st Qu.:0.0000 1st Qu.:2.000 male :843 1st Qu.:22.00
## Median : 655 Median :0.0000 Median :3.000 Median :27.43
## Mean : 655 Mean :0.3838 Mean :2.295 Mean :29.63
## 3rd Qu.: 982 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:37.00
## Max. :1309 Max. :1.0000 Max. :3.000 Max. :80.00
## NA's :418
## SibSp Parch Ticket Embarked HasCabinNum
## Min. :0.0000 Min. :0.000 CA. 2343: 11 C:272 No :1014
## 1st Qu.:0.0000 1st Qu.:0.000 1601 : 8 Q:123 Yes: 295
## Median :0.0000 Median :0.000 CA 2144 : 8 S:914
## Mean :0.4989 Mean :0.385 3101295 : 7
## 3rd Qu.:1.0000 3rd Qu.:0.000 347077 : 7
## Max. :8.0000 Max. :9.000 347082 : 7
## (Other) :1261
## Friend_size Fare_pp Title Deck Ticket_class
## Min. : 1.0 Min. : 0.000 Master: 61 U :1014 3 :429
## 1st Qu.: 1.0 1st Qu.: 7.579 Miss :260 C : 94 2 :278
## Median : 1.0 Median : 8.050 Mr :757 B : 65 1 :210
## Mean : 2.1 Mean : 14.765 Mrs :197 D : 46 P : 98
## 3rd Qu.: 3.0 3rd Qu.: 15.000 Other : 34 E : 41 S : 98
## Max. :11.0 Max. :128.082 A : 22 C : 77
## (Other): 27 (Other):119
## Family_size Group_size Age_group
## Min. : 1.000 Min. : 1.000 20-29 :552
## 1st Qu.: 1.000 1st Qu.: 1.000 30-39 :229
## Median : 1.000 Median : 1.000 40-49 :171
## Mean : 1.884 Mean : 2.194 10-19 :162
## 3rd Qu.: 2.000 3rd Qu.: 3.000 0-9 :100
## Max. :11.000 Max. :11.000 50-59 : 62
## (Other): 33
A quick correlation plot of the numeric attributes to get an idea of how they might relate to one another. You can see that we have dropped two chr
attributes: Title and Deck. We could include them if we convert the character value into numbers. For example, the Title could be converted into 1-6 numbers as 1 represents Mr
, 2 represents Mrs
, and so on.
## Warning: package 'kableExtra' was built under R version 3.6.3
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
lower <- round(cor,2)
lower[lower.tri(cor, diag=TRUE)]<-""
lower <- as.data.frame(lower)
knitr::kable(lower, booktabs = TRUE,
caption = 'Coorelations among attributes') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", font_size = 8))
Survived | Pclass | Sex | Age | SibSp | Parch | Embarked | HasCabinNum | Friend_size | Fare_pp | Title | Deck | Ticket_class | Family_size | Group_size | Age_group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Survived | -0.34 | -0.54 | -0.05 | -0.04 | 0.08 | -0.17 | 0.32 | 0.07 | 0.29 | -0.05 | -0.3 | -0.04 | 0.02 | 0.08 | -0.03 | |
Pclass | 0.12 | -0.43 | 0.06 | 0.02 | 0.19 | -0.71 | -0.08 | -0.77 | -0.22 | 0.73 | -0.02 | 0.05 | -0.07 | -0.45 | ||
Sex | 0.06 | -0.11 | -0.21 | 0.1 | -0.14 | -0.17 | -0.12 | 0.01 | 0.13 | 0 | -0.19 | -0.2 | 0.07 | |||
Age | -0.27 | -0.15 | -0.07 | 0.3 | -0.2 | 0.38 | 0.49 | -0.32 | 0 | -0.26 | -0.2 | 0.98 | ||||
SibSp | 0.37 | 0.07 | -0.01 | 0.68 | -0.05 | -0.2 | 0.01 | 0.05 | 0.86 | 0.73 | -0.27 | |||||
Parch | 0.05 | 0.04 | 0.65 | -0.03 | -0.09 | -0.03 | 0.06 | 0.79 | 0.67 | -0.14 | ||||||
Embarked | -0.21 | 0.01 | -0.3 | -0.03 | 0.24 | -0.04 | 0.07 | 0.02 | -0.06 | |||||||
HasCabinNum | 0.1 | 0.65 | 0.14 | -0.96 | -0.03 | 0.01 | 0.09 | 0.3 | ||||||||
Friend_size | 0.09 | -0.19 | -0.1 | 0.12 | 0.8 | 0.97 | -0.2 | |||||||||
Fare_pp | 0.18 | -0.7 | 0.1 | -0.05 | 0.09 | 0.38 | ||||||||||
Title | -0.15 | 0.01 | -0.18 | -0.18 | 0.48 | |||||||||||
Deck | 0.02 | -0.01 | -0.1 | -0.32 | ||||||||||||
Ticket_class | 0.06 | 0.11 | 0.01 | |||||||||||||
Family_size | 0.85 | -0.26 | ||||||||||||||
Group_size | -0.2 | |||||||||||||||
Age_group |
The plot shows not only the correlation between other attributes, which can potentially be used as predictors, with the dependent attribute Survived, but also the correction among potential predictors. In terms of correlation with Survived, Sex has the largest value but in negative -0.54, the next is Pclass with -0.34. So if we can only have two predictors for survival, the first two we should use are Sex and Pclass. If I want to choose five predictors for survival, I would choose Sex, Pclass, HasCabinNum, Deck, and Fare_PP.
The largest correlation value is between Age and Age_group with 0.98. It makes sense because Age_group is a grouping of Age.
We can also observe that Pclass has a high correlation with HasCabinNum (71), Fare_pp (-77) and Deck (73). It suggests that if we have Pclass in our model, we may not need to use Fare_pp, HasCabinNum or Deck since they are effectively telling us the same thing that we have suspected at the beginning that is the “social class” of a passenger. This social class can be interpreted as the richer people, who paid more money on a ticket, has a better cabin.
A similar concept can be read between attribute Group_size and the other three attributes Friend_Size, Family_size, SibSp, and Parch. Family_size also has a high correlation with both SibSp and Parch. But Family_size has a very low correlation with Friend_Size.
The important point is that the correlation analysis is very useful. It provides the basic reasons for our predictor selection. The idea is that we should choose attributes that have a high correlation with the response variable. For example, if only choose three predictors in a model to predict Survived, we should choose the Sex, Pclass, and HasCabinNum because they have the three highest absolute correlation values with Survived. If in a model we have chosen Pclass we may not need to choose Fare_pp and Deck because these three have large correction values.