3.3 The Titanic Problem
The objective of the Titanic problem defined on the Kaggle website as stated in the following:
"The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (i.e. name, age, gender, socio-economic class, etc.)."
The Challenge
The competition is simple: we want you to use the Titanic passenger data (name, age, price of the ticket, etc.) to try to predict who will survive and who will die.
The requirement is to predict passengers’ ** survival**. Like many other real data science problems, Prediction is to build a model which takes input data and produces an output. A prediction model is a mathematical formula that takes input from historical facts reflecting past events and produces an output that to make predictions about future or otherwise unknown events. A simple way to understand a model is to think a model in the following three ways:
- The relationship between input and output can be expressed by some kind of a math formula. It is generally called a definable model, the math formula can be as simple as a function of Polynomial expression or as complected as a regression model, or other statistical models.
- Some models can not be explicitly expressed with math formulas, instead, they are expressed in rules. those are rule-based models.
- Other models can not be expressed in a math formula nor in rules. The solution is to build a neural networks to do prediction. A Neural network can be regarded as a “black box”, which takes input and produce output, the internal connections are transparent to users. Machine learning is more focused on models rooted in Neural networks.
Any model fundamentally expresses relationships between inputs and outputs. So as part of understanding the problem, We could interpret that the Kaggle Titanic challenge is to find creditable relationships between input data and output data (which survive or not). Once the relationship is found, we can express using either a math formula, a set of rules, or a Neural Network model.
The Data
Kaggle competition usually provides competition data. There is a “Data” tab on any competition site. Click on the Data tab at the top of the competition page, you will find the raw data provided and most of the time there is a brief explanation of the data attributes1 too.
There are three files in the Titanic Challenge:
- train.csv,
- test.csv, and
- gender_submission.csv.
The training set is supposedly used to build your models. The training set provides the outcome (also known as the “ground truth”) for each passenger. Your model will be based on attributes like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, there is no ground truth for each passenger is provided. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
The data sets have also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.
The Submission
Submission at the Titanic competition is equivalent to the requirements on the final report of any data science project. that is one of the questions you need to understand at the beginning of the project.
Titanic competition requires the results need be submitted in the file. The file structure is demonstrated in the “gender_submission.csv”. It is also provided as an example that shows how you should structure your results, which means predictions.
The example submission in “Gender_submission” predicts that all female passengers survived, and all male passengers died. It is clearly biased. Your hypotheses regarding survival will probably be different, which will lead to a different submission file. Properly it is a good idea now to rename the “Gender_submission.csv” file into “My_submission.csv” now. So you know that you have to submit “my_submission.csv” as the final report of your project and the submission indicates the completion of your project.
Do it yourself:
- Download data file from Kaggel web site.(https://www.kaggle.com/c/titanic/data)
- Unzip it into your working directory (eg. “./data/”).
- Rename “Gender_submission.csv” file into “My_submission.csv”.
Make sure your submission should have:
- “PassengerId” column containing the IDs of each passenger from test.csv.
- “Survived” column (that you will create!) with a “1” for the rows where you think the passenger survived, and a “0” where you predict that the passenger died.
We have used Data Science terminology here. Data represent objects in the natural world. Object properties are represented by attributes. That is a data record has a number of attributes representing a natural object with a number of properties. records are also called observations or samples in statistics, the property is also called variables, parameters or dimensions↩︎