Kaggle's Titanic Challenge

This notebook presents feature engineering using the well known Titanic challenge from Kaggle. This document is meant for documentation purposes demonstrating feature analysis using Pandas and is intended for beginners. The default settings will lead to a test data classification accuracy of about 80% when submitted on Kaggle and rank among the top 3-4% on its leaderboard.

source: wikimedia.org

last update: 19/12/2020
Author







Christopher
Hahne, PhD

Data acquisition

We first load the official Kaggle csv files into a pandas.DataFrame.

For better result assessment, we further load the entire data set and extract the ground truth test data.

Kaggle's default classification

By default Kaggle presents a neat and short solution using a RandomForestClassifier based on only 4 features with an accuracy of 77.51%, which we aim to improve.

Data analysis

At the beginning, it is important to familiarize oneself with the data. Let's specifically have a look at the column titles, that will act as features.

By reviewing the the first entries as an excerpt, we observe the following plain features for each passenger:

Let's get a first impression of the numerical distribution. From this, we observe sparse data (e.g. see Age count).

Next, we visualize existing feature data to analyze relations and to make assumptions on potential underlying concepts.

From the plots above, we get an idea of how much of an impact class, gender and number of family members onboard (siblings/spouses, parents/children) had on the survival/death rate. To see how features depend on each other, it is meaningful to catch a glimpse on the correlation matrix.

From the correlation matrix, we observe strong correlations between Pclass and Fare as well as SibSp and Parch, which appear reasonable. These links will be useful for data imputation as demonstrated in the following feature engineering stage.

Feature Engineering

The major task for a potential improvement on the survival prediction is to extract, transform, reduce or simplify the given passenger data. For that purpose, we make a copy of the original data frame which is subject to change.

Binary Label Conversion

First, it may be obvious to take the given gender information and convert the original string type to binary values.

Feature Extraction (Title & Ethnicity from Passenger Name)

Passenger names contain rich information encoded as strings. One of the most common approaches is to extract the salutation and re-organize them into groups with specific titles.

A variety of nationalities were on board of the Titanic. Given that the ship was British-flagged, speaking English might have been of a great advantage when allocating seats in lifeboats. Below are attempts to generate nationality data from the passenger names.

As there seems to be a link between language and survival, we may regard people's ethnicity as an additional measure.

Imputation of missing data

From the presence of NaN values, we observe that our data ia partially incomplete. Instead of removing incomplete data columns and their useful data, we may employ data imputation techniques and fill missing information with reasonable guesses.

With regards to the ticket fare, we fill the single missing entry by a corresponding Pclass-grouped mean since the fare feature strongly correlates with passenger class.

Missing age information is imputed similarly, whereas groups rely on more features from which we infer the median value.

As an alternative to this, we may assign a dedicated class value for a group of passengers that lacks certain feature information.

Binning

To reduce feature complexity, numerical values may be grouped together in several bins, which act as feature categories. Let's have a look at the distribution of the Fare feature and make a reasonable choice on the boundaries.

Cabin