5.1 General Data Prepartion Tasks

Section 1.3 has listed a number of tasks that need to be performed to make data suitable for analysis. Depends on the understanding of the problem, the tasks can be different. In our previous analyses at both records and attributes levels, we have found some problems. These problems need to be solved first of all.

  1. There are inappropriate data types that need conversion. For example, a lot of features need to be converted into numeric ones so that the machine learning algorithms can process them.
  2. There are errors or missing values.
  3. There are attributes’ values that need normalization. There are some features have widely different value’s range, so the value needs to be converted into roughly the same scale.
  4. There are also attribute values that need to be grouped or transformed into more manageable meaningful groups.

In this chapter, we will carry on using the Titanic problem to demonstrate the tasks to be performed and the methods that can be used to accomplish these tasks. Remember the ultimate goal of the data preprocessing is to make the dataset suitable for analysis.

The analytical methods used in this chapter are again a mixture of Descriptive data analysis and Exploratory analysis.