Summary

The purpose of this book is to provide a hand on practical exercise in doing a data science project. Clearly, we cannot cover the complete available methods, models, and algorithms for a data science project. The most important thing is to understand the process of doing a data science project. The first step, as indicated by the 6-step process in section 1.3, is “understand the problem”.

We have chosen to use the Titanic problem to demonstrate the whole data analytical process. However, a real-world problem is far more complicated than this well-defined problem. Most business organizations may not know the exact problem (that is part of the reason why they want to do data analysis or business analysis) or they know the problem (in general) but the problem can not be expressed explicitly.

I have met a situation that a business organization that has created a data center and collected all their business operational data. The boss asked to analyze these data and find:

  1. Is there are problems?
  2. If yes, how to overcome these problems?
  3. If not, how to improve the business operations?

You see, here the problem is how to define the problem? how to convert the business problem into a data science problem.

For example, the first problem in the above list needs to know what is the normal or expected performance? How to evaluate the performance? In terms of turnover or profit? In what time scale? It could be short of profit at the moment but it not causes alarm because of the recent investment for developing a new market. In a long run, it will have a great ROI (Return on Investment). The second problem demands to identify the cause of the problem and the third to identify the KIP (Key Performance Indicators). they are both to identify the relationships between predictor and dependent variables. But they can be completely different sets.

Understand problem is actually more complicated in the real world. Until you have completely understood it and turned it into a list of analytical problems you can move to the next step.

With the Titanic problem, combining the story and the requirements on the Kaggle website, I would consider these:

  • On April 14 and 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. The overall survival rate is 32%.
  • One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
  • Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
  • The story tells us that when they were getting on board the lifeboats, they applied a policy of “women and children first” and also “the ship crew is the last”.
  • Sometimes the family was boarding the lifeboat together and some of the family members were swimming together too.

Those thoughts form some kinds of assumptions in mind. They will guide more detailed data explorations later.