Kaggle's Titanic Challenge

This notebook presents feature engineering using the well known Titanic challenge from Kaggle. This document is meant for documentation purposes demonstrating feature analysis using Pandas and is intended for beginners. The default settings will lead to a test data classification accuracy of about 80% when submitted on Kaggle and rank among the top 3-4% on its leaderboard.

source: wikimedia.org

last update: 19/12/2020
Author







Christopher
Hahne, PhD

Data acquisition

We first load the official Kaggle csv files into a pandas.DataFrame.

For better result assessment, we further load the entire data set and extract the ground truth test data.

Kaggle's default classification

By default Kaggle presents a neat and short solution using a RandomForestClassifier based on only 4 features with an accuracy of 77.51%, which we aim to improve.

Data analysis

At the beginning, it is important to familiarize oneself with the data. Let's specifically have a look at the column titles, that will act as features.

By reviewing the the first entries as an excerpt, we observe the following plain features for each passenger:

Let's get a first impression of the numerical distribution. From this, we observe sparse data (e.g. see Age count).

Next, we visualize existing feature data to analyze relations and to make assumptions on potential underlying concepts.

From the plots above, we get an idea of how much of an impact class, gender and number of family members onboard (siblings/spouses, parents/children) had on the survival/death rate. To see how features depend on each other, it is meaningful to catch a glimpse on the correlation matrix.

From the correlation matrix, we observe strong correlations between Pclass and Fare as well as SibSp and Parch, which appear reasonable. These links will be useful for data imputation as demonstrated in the following feature engineering stage.

Feature Engineering

The major task for a potential improvement on the survival prediction is to extract, transform, reduce or simplify the given passenger data. For that purpose, we make a copy of the original data frame which is subject to change.

Binary Label Conversion

First, it may be obvious to take the given gender information and convert the original string type to binary values.

Feature Extraction (Title & Ethnicity from Passenger Name)

Passenger names contain rich information encoded as strings. One of the most common approaches is to extract the salutation and re-organize them into groups with specific titles.

A variety of nationalities were on board of the Titanic. Given that the ship was British-flagged, speaking English might have been of a great advantage when allocating seats in lifeboats. Below are attempts to generate nationality data from the passenger names.

As there seems to be a link between language and survival, we may regard people's ethnicity as an additional measure.

Imputation of missing data

From the presence of NaN values, we observe that our data ia partially incomplete. Instead of removing incomplete data columns and their useful data, we may employ data imputation techniques and fill missing information with reasonable guesses.

With regards to the ticket fare, we fill the single missing entry by a corresponding Pclass-grouped mean since the fare feature strongly correlates with passenger class.

Missing age information is imputed similarly, whereas groups rely on more features from which we infer the median value.

As an alternative to this, we may assign a dedicated class value for a group of passengers that lacks certain feature information.

Binning

To reduce feature complexity, numerical values may be grouped together in several bins, which act as feature categories. Let's have a look at the distribution of the Fare feature and make a reasonable choice on the boundaries.

Cabin

One-Hot Encoding

Another technique often used for feature engineering is called One-Hot Encoding (OHE), which is meant for categorical features. OHE involves the transformation of single feature into several binary labels and thus yields a higher dimensionality. OHE helps avoiding numerical values, which is recommended in cases where the order of labels is irrelevant. Otherwise, the numerical order may influence the decision making even when there is no hierarchical relationship.

Feature reduction

Due to feature similarity between SibSp and Parch feature as seen in above correlation plot, we combine these two data columns to a single one and thus reduce the number of features.

To further simplify the family size information, we convert the number of relatives per passenger to a binary level stating whether a person is travelling alone or accompanied by others.

Family-connected survival from Ticket

According to a blog post from Vidhya, it is a viable approach to gather information about relatives from the ticket data. The goal behind this assumption is that family members are likely to survive or die together as a collective.

Feature selection

Finally, we can reflect on how much the engineered features contribute to a model. Instead of a heuristic trial and error, we analyze the importance of each feature prior to manual selection.

Model selection and cross-validation

Hyper-parameter tuning

The estimation models above strongly rely on hyper-parameters, which we aim to optimize iteratively using the model_selection module from sklearn.

Kaggle accuracy benchmark