4.4 Actual Attributes Types Examination

Since we have our raw data in RStudio, We can exam attributes’ types. From figure 4.4, we can see that all the attributes have three types, int, Factor, num.

  • Attributes have int types are: PassengerId, Survived, SibSp, Parch.

  • Attributes has Factor types are: Name, Sex, Ticket, Cabin and Embarked.

  • Attributes has num types are: Age and Fare.

We know that, the type int is for attribute that has an integer value; and num is for an numeric attribute, which has the values of real numbers.

Type Factor is R language’s way to say category type. It is a attribute that can take on one of a limited, and usually fixed, number of possible values, such as blood type.

Attributes types affect the operations we can apply on that attributes. In other words inappropriate types can prevent us to do proper analysis on that attribute. For example, it does not make sense to calculate average on sex, so it is better to be with a type of Category, in R is a Factor. Similarly, Survived will have only two values 0 or 1, to represent death or live. It makes sense to be an Factor too. Being a int type, it will prevent us to apply many methods that only works for a Factor type attribute.

Another example is Name, its original type is Factor to reflect on its uniqueness. However, Type “Factor” is not good for string processing. It has been prevented that to apply regular expression4 on it. So, it is appropriate to change it into chr as a character.

There are other inappropriate or wrong attribute types too such as SibSp and Parch are currently typed int. May be they should be considered as Factor. It is a common practice that data scientists apply different analyses on a attribute and change the attribute type to apply other different algorithms again5. The goal is to dig the insight out of data.

So, looking into data attributes types, compare with the original meaning of each attributes can help us to spot any inappropriate types or wrong types.

Thinking:

Is Servived typed int approriate?

What other attributes do you think are in a wrong type?

  1. A regular expression is a sequence of characters that define a search pattern, which is used by string-searching algorithms to find a particular string or validate a input string.↩︎

  2. You will see we have changed some attribute type between num and factor frequently later↩︎