Alonso Silva (@alonsosilva) | 2017/08/18
Note: I'm using the Titanic dataset which can be downloaded here.
First of all, I'm going to import some packages that I'll need to perform the analysis of this dataset
# Import numpy, pandas, matplotlib, and seaborn to analyze and plot the data
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
Now, I'm going to load the dataset from the csv file (downloaded in the same directory as this notebook) to a dataframe called titanic
# Load the data from the csv file to a dataframe called titanic
titanic = pd.read_csv('titanic-data.csv')
I'll take a look at the first few rows of the dataset to see what does the dataset contain
# Look at the head of the dataset
titanic.head()
A more thorough description of the dataset can be found here.
Let's see how many passengers information does this dataset have
# Number of passengers in the dataset
len(titanic)
The quantity of passengers in the Titanic was approximately 1,317 passengers and the quantity of people on board (passengers and crew) was approximately 2,222 passengers (source: http://www.titanicfacts.net/titanic-passengers.html), which means that this is a subset of 891 passengers of the total dataset (which represents 68% of the passengers and 40% of people on board approximately).
I would like to answer if survivorship is related to the class you traveled.
I'll check if there is any missing information in these columns
# Count how many NaN values does each serie have
titanic.isnull().sum()
The ages of 177 people are missing from the dataset. Similarly, 687 people are missing information about their cabin numbers. Let's see what's the percentage of the values that are not NaN to evaluate how much data are we missing.
'''
The function counts what's the percentage of is-not-NAN values compared
to the total quantity of values
'''
def percentage_of_is_not_NAN(column):
length_column = len(column)
length_column_with_no_NAN_values = len(column.dropna())
return ((length_column_with_no_NAN_values * 1.0)/ length_column) * 100
print "Percentage of not NaN in Age:", percentage_of_is_not_NAN(titanic['Age'])
print "Percentage of not NaN in Cabin numbers:", percentage_of_is_not_NAN(titanic['Cabin'])
We don't lose much data when we exclude the passengers that are missing information about their ages, however we may be losing quite a lot of data when we exclude the passengers who are missing information about their cabin numbers. We'll get back to how to handle both of those passengers groups when we'll use those columns.
My initial thought is that different classes had different rates of survival because wealthy people may have been put on lifeboats first than the rest of the passengers. Let's see if that's the case.
I come back to use the original DataFrame 'titanic' since both variables survivorship and class don't have NaN values.
Let's take a look at how many passengers there are for each class
# How many passengers are there by each class
print titanic.groupby('Pclass')['Pclass'].value_counts()
There are 216 passengers in the 1st class, 184 passengers in the 2nd class, and 491 passengers in the 3rd class.
Let's see how many passengers survived in each class
# How many passengers survived grouped by class
print titanic.groupby('Pclass')['Survived'].value_counts()
In the 1st class, 63% of the passengers survived (136 out of 216); in 2nd class, 47% of passengers survived (87 out of 184); while in the 3rd class, only 24% of the passengers survived (119 out of 491).
We want to explore if there is a correlation between survivorship and class. For this, we look at the correlation between the survivorship and in which class did the passengers travel.
# Standardize each column of the given DataFrame.
def standardize(df):
return (df - df.mean()) / df.std(ddof=0)
# Correlation between survivorship and class
np.corrcoef(standardize(titanic['Survived']),\
standardize(titanic['Pclass']))
We see that indeed there is a negative correlation between survivorship and the class in which the passengers traveled ($r=-0.34$). That means that if you traveled in 3rd class you were less likely to survive than if you traveled in 2nd class and similarly, if you traveled in 2nd class you were less likely to survive than if you traveled in 1st class.
However, this issue may be due to the fact that more women were traveling in the first class and women survived more than men, or because more children were traveling in the first class and children survived more than adults or because of the location where the cabins of the first class were located on the boat. Let's explore these issues in the following.
As an initial hypothesis, I think that gender could be a good predictor of survival rate because women may have been given the priority to board the available boats first. Let's see if that's the case
# How many passengers of each gender
titanic.groupby('Sex')['Sex'].value_counts()
There are many more men (577 or 65%) than women (314 or 35%) in the dataset.
Let's see how many passengers survived in each gender class
# How many passengers of each gender survived
print titanic.groupby('Sex')['Survived'].value_counts()
We notice that 74% of the women survived (233 out of 314) while only 19% of the men survived (109 out of 577).
Let's take a look at the correlation between survivorship and being female
# Correlation between survivorship and being female
titanic[['isFemale','isMale']] = pd.get_dummies(titanic['Sex'])
np.corrcoef(standardize(titanic['Survived']),\
standardize(titanic['isFemale']))
There seems to be a high correlation ($r=0.54$) between being female and survivorship. This suggests that it may be the case that for example women were given the priority to board the boats. This is confirmed by the article which says that
When the lifeboats were finally lowered officers gave the order that "women and children" should go first. One hundred and fifteen men in first class and 147 men from second class are recorded as having stood back to make space available and as a result died.
Perhaps, gender is a confounding variable that explains the previous correlation between class and survivorship. Let's study this in the following.
We have seen that class is negatively correlated with survivorship, but it may be that the presence of an important confounding variable (gender) explains that correlation. Let's explore this question
# How many passengers of each class and gender survived
grouped = titanic.groupby(['Pclass', 'Sex'])
grouped['Survived'].value_counts()
In 1st class 97% of women survived (91 out of 94), in 2nd class 92% of women survived (70 out of 76), and in 3rd class 50% of women survived (72 out of 144).
In 1st class 37% of men survived (45 out of 122), in 2nd class 16% of men survived (17 out of 108), and in 3rd class 14% of men survived (47 out of 347).
For women as for men, it seems that being in the 1st class increased the chances of survivorship.
Let's see the correlation.
I have found the partial correlation function implemented here by fabianp, which I copy in the following cell
"""
Partial Correlation in Python (clone of Matlab's partialcorr)
This uses the linear regression approach to compute the partial
correlation (might be slow for a huge number of variables). The
algorithm is detailed here:
http://en.wikipedia.org/wiki/Partial_correlation#Using_linear_regression
Taking X and Y two variables of interest and Z the matrix with all the variable minus {X, Y},
the algorithm can be summarized as
1) perform a normal linear least-squares regression with X as the target and Z as the predictor
2) calculate the residuals in Step #1
3) perform a normal linear least-squares regression with Y as the target and Z as the predictor
4) calculate the residuals in Step #3
5) calculate the correlation coefficient between the residuals from Steps #2 and #4;
The result is the partial correlation between X and Y while controlling for the effect of Z
Date: Nov 2014
Author: Fabian Pedregosa-Izquierdo, f@bianp.net
Testing: Valentina Borghesani, valentinaborghesani@gmail.com
"""
from scipy import stats, linalg
def partial_corr(C):
"""
Returns the sample linear partial correlation coefficients between pairs of variables in C, controlling
for the remaining variables in C.
Parameters
----------
C : array-like, shape (n, p)
Array with the different variables. Each column of C is taken as a variable
Returns
-------
P : array-like, shape (p, p)
P[i, j] contains the partial correlation of C[:, i] and C[:, j] controlling
for the remaining variables in C.
"""
C = np.asarray(C)
p = C.shape[1]
P_corr = np.zeros((p, p), dtype=np.float)
for i in range(p):
P_corr[i, i] = 1
for j in range(i+1, p):
idx = np.ones(p, dtype=np.bool)
idx[i] = False
idx[j] = False
beta_i = linalg.lstsq(C[:, idx], C[:, j])[0]
beta_j = linalg.lstsq(C[:, idx], C[:, i])[0]
res_j = C[:, j] - C[:, idx].dot( beta_i)
res_i = C[:, i] - C[:, idx].dot(beta_j)
corr = stats.pearsonr(res_i, res_j)[0]
P_corr[i, j] = corr
P_corr[j, i] = corr
return P_corr
This implementation of partial correlation takes a DataFrame and computes the partial correlation of pairwise columns (without the head) controlling by all the other variables. However, it doesn't do the standardization, therefore we first standardize the dataframes of interest, and then compute the partial correlation between the different variables
C = standardize(titanic[['Survived', 'Pclass', 'isFemale']])
partial_corr(C)
The negative partial correlation between class and survivorship by taking into account gender is quite high in absolute value ($r_{X1,X2|X3}=-0.32$), which means that even by considering gender, there is a negative correlation between survivorship and the class in which the passengers traveled, even if lower than before considering gender ($r_{X1,X2}=-0.34$).
Since the priority for boarding the boats was given to women and children, we study the correlation between survivorship and the variable being a child. It's difficult to define precisely at which age (which is the information we have) does one stop being a child. I tried different values and they gave similar results, thus I'll consider being a child to be of any age between 0 and 12 years old.
First, we exclude the NaN values.
'''
Exclude nan values from both columns (NaN values are in the age column)
'''
df2 = titanic[['Survived', 'Pclass', 'Sex', 'Age']].dropna()
We create a new variable 'isChild' that will indicate if the passenger had age less or equal to 12 years old. Similarly, we create the indicator variable 'isFemale' if the gender of the passenger is female.
'''
Define two new variables 'isChild' if the age is <=12, and
'isFemale' if the gender of the passenger is female
'''
df2['isChild'] = (df2['Age'] <= 12)
df2[['isFemale','isMale']] = pd.get_dummies(df2['Sex'])
Let's see how many children there are in the dataset
# How many children in the dataset
df2.groupby('isChild')['isChild'].value_counts()
There are 10% of children (69 out of 714) in the dataset.
Let's see how many of them survived
# How many passengers survived
print "Population survivorship:\n", df2['Survived'].value_counts()
# How many children passengers survived
print "Children survivorship:\n", df2.groupby('isChild')['Survived'].value_counts()
58% of children survived (40 out of 69), compared to 41% of the dataset of the remaining population (290 out of 714). There seems to be a positive correlation between being a child and survivorship. Let's see if that's the case
# Correlation between survivorship and being a child
np.corrcoef(standardize(df2['Survived']),\
standardize(df2['isChild']))
Indeed, there is a positive correlation $r=0.12$ between survivorship and being a child. Perhaps, being a child is a confounding variable that explains the previous correlation between class and survivorship. Let's study this in the following.
We have seen that class is negatively correlated with survivorship, but it may be that the presence of an important confounding variable (being a child) explains that correlation. Let's explore this question
# How many passengers of each class and if they're children survived
grouped = df2.groupby(['Pclass', 'isChild'])
grouped['Survived'].value_counts()
In 1st class 75% of children survived (3 out of 4), in 2nd class 100% of children survived (17 out of 17), and in 3rd class 42% of children survived (20 out of 28).
In the not-children population, in 1st class 65% of passengers survived (119 out of 182), in 2nd class 42% of passengers survived (66 out of 156), and in 3rd class 21% of passengers survived (65 out of 307).
Let's see if there is a partial correlation.
'''
Partial correlations between survivorship, class, and being a child
'''
C = standardize(df2[['Survived', 'Pclass', 'isChild']])
partial_corr(C)
The negative partial correlation between class and survivorship by taking into account being a child is quite high in absolute value ($r_{X1,X2|X3}=-0.39$), which means that even by considering being a child, there is a negative correlation between survivorship and the class in which the passengers traveled, and it is even higher than before considering being a child ($r_{X1,X2}=-0.34$).
We have seen that the correlation between survivorship and class is strong and that it cannot be explained by gender or by being a child but perhaps combining both variables the partial correlation changes. Let's see if that's the case
'''
Partial correlations between survivorship, class, gender, and
being a child
'''
C = standardize(df2[['Survived', 'Pclass', 'isFemale', 'isChild']])
partial_corr(C)
The negative partial correlation between class and survivorship by taking into account gender and being a child is quite high in absolute value ($r_{X1,X2|X3,X4}=-0.35$), which means that even by considering gender and being a child, there is a negative correlation between survivorship and the class in which the passengers traveled, and it is even higher in absolute value than before considering both variables ($r_{X1,X2}=-0.34$).
Let's see how does the position you were in the cabin change these correlations.
As it was previously observed, 687 out of the 891 passengers are missing information about their cabin location.
'''
New DataFrame taking out NaN values from the Cabin, the other variables
don't have NaN values
'''
df = titanic[['Survived', 'Pclass', 'Sex', 'Cabin']].dropna()
print "How many non nan values:", len(df)
print "How many unique values in the cabin:", len(df['Cabin'].unique())
There are 23% of values which are not 'NaN' (204 values), and from those values there are 147 unique values.
Let's see how do the remaining values distribute by class
# How many passengers are there by each class
print "Passengers by class:\n", titanic.groupby('Pclass')['Pclass'].value_counts()
# How many cabin passengers are there by each class
print "Remaining passengers by class:\n", df.groupby('Pclass')['Pclass'].value_counts()
The remaining values represent 81% of the 1st class (176 out of 216), 9% of the 2nd class (16 out of 184), and 2% of the 3rd class (12 out of 491). There are not many remaining passengers from 2nd and 3rd class.
Let's look at the correlation between survivorship and class in the remaining dataset
# Correlation between survivorship and class
np.corrcoef(standardize(df['Survived']),\
standardize(df['Pclass']))
There is no correlation between survivorship and class the passengers traveled in the remaining values, therefore we cannot use these remaining values to control by location. One thing we can do is to look at the locations where there were passengers from two or three different classes, and see if there are differences in their survivorship rates in these common locations
# Unique cabin numbers for different classes
grouped = df.groupby('Pclass')
class_1st = grouped.get_group(1)
class_2nd = grouped.get_group(2)
class_3rd = grouped.get_group(3)
print "Cabin numbers of 3rd class:\n", class_3rd['Cabin'].unique()
print "Cabin numbers of 2nd class:\n", class_2nd['Cabin'].unique()
print "Cabin numbers of 1st class:\n", class_1st['Cabin'].unique()
We use NumPy function np.intersect1d to find the intersection between the different cabin numbers
# Intersection between different cabin numbers
cabin_1st = class_1st['Cabin'].unique()
cabin_2nd = class_2nd['Cabin'].unique()
cabin_3rd = class_3rd['Cabin'].unique()
print "Intersection between 1st and 2nd class:", np.intersect1d(class_1st, cabin_2nd)
print "Intersection between 1st and 3rd class:", np.intersect1d(cabin_1st, cabin_3rd)
print "Intersection between 1st and 2nd class:", np.intersect1d(cabin_2nd, cabin_3rd)
We notice that there are too many unique values compared to the total number of values and that there is no intersection between the cabin numbers the different classes traveled. Perhaps, we have been too restrictive. Let's consider just the deck in which they traveled. I notice that the deck is indicated by the first letter from the website Encyclopedia Titanica. I want to consider just the letter as an approximation of the location. Let's do that
'''
Deck keeps the first letter of the cabin number, the other variables are
indicators of the Deck at which the passengers belonged
'''
df['Deck'] = df['Cabin'].apply(lambda x: x[0])
df['isDeckA'] = df['Deck'].apply(lambda x: x == 'A')
df['isDeckB'] = df['Deck'].apply(lambda x: x == 'B')
df['isDeckC'] = df['Deck'].apply(lambda x: x == 'C')
df['isDeckD'] = df['Deck'].apply(lambda x: x == 'D')
df['isDeckE'] = df['Deck'].apply(lambda x: x == 'E')
df['isDeckF'] = df['Deck'].apply(lambda x: x == 'F')
df['isDeckG'] = df['Deck'].apply(lambda x: x == 'G')
Let's see in which deck did each class traveled
# Unique decks for each class
grouped = df.groupby('Pclass')
class_1st = grouped.get_group(1)
class_2nd = grouped.get_group(2)
class_3rd = grouped.get_group(3)
deck_1st = class_1st['Deck'].unique()
deck_2nd = class_2nd['Deck'].unique()
deck_3rd = class_3rd['Deck'].unique()
print deck_1st
print deck_2nd
print deck_3rd
Let's see the decks in which there are intersections between the classes
# Intersection between different decks
print "Intersection between 1st and 2nd class:",\
np.intersect1d(deck_1st, deck_2nd)
print "Intersection between 1st and 3rd class:",\
np.intersect1d(deck_1st, deck_3rd)
print "Intersection between 1st and 2nd class:",\
np.intersect1d(deck_2nd, deck_3rd)
We notice that there are intersections between all the classes in deck 'E', while some intersection between 1st and 2nd also in deck 'D', and some intersection between 2nd and 3rd in deck 'F'.
We use DataFrame.query to create new indicator variables which will be true when the deck 'X' the passenger belonged was indeed 'X'. Then in these intersection decks, we look at the survivorship rates by class
# Survivorship by class in the intersection decks: D, E, F
DeckE = df.query('isDeckE == 1')
DeckD = df.query('isDeckD == 1')
DeckF = df.query('isDeckF == 1')
print "E Deck: Survivorship by class:\n",\
DeckE.groupby('Pclass')['Survived'].value_counts()
print "D Deck: Survivorship by class:\n",\
DeckD.groupby('Pclass')['Survived'].value_counts()
print "F Deck: Survivorship by class:\n",\
DeckF.groupby('Pclass')['Survived'].value_counts()
In the E Deck: 72% of 1st class passengers survived (18 out of 25), 75% of 2nd class passengers survived (3 out of 4), and 100% of 3rd class passengers survived (3 out of 3).
In the D Deck: 76% of 1st class passengers survived (22 out of 29); while 75% of 2nd class passengers survived (3 out of 4).
In the F Deck: 88% of 2nd class passengers survived (7 out of 8); while 20% of 3rd class passengers survived (1 out of 5).
Although, it's difficult to conclude much with such low numbers, there doesn't seem to make a difference (at least from the numbers in the E Deck and in the D Deck) at which class you belonged. It seems to make a difference in which deck you were located.
As we dicover from the following article about myths related to the titanic
Each class of passengers had access to their own decks and allocated lifeboats - although crucially no lifeboats were stored in the third class sections of the ship.
Third class passengers had to find their way through a maze of corridors and staircases to reach the boat deck. First and second class passengers were most likely to reach the lifeboats as the boat deck was a first and second class promenade.
This explains thus the strong correlation that we found between class and survivorship. Apparently, this was to comply with immigration rules at the time
Gates did exist which barred the third class passengers from the other passengers. But this was not in anticipation of a shipwreck but in compliance with US immigration laws and the feared spread of infectious diseases.
Third class passengers included Armenians, Chinese, Dutch, Italians, Russians, Scandinavians and Syrians as well as those from the British Isles - all in search of a new life in America.
The conclusion of the inquiry that was made at the time concluded
"No evidence has been given in the course of this case that would substantiate a charge that any attempt was made to keep back the third class passengers."
There are several potential limitations with these results. For example, the dataset doesn't consider information of all the people on board (passengers and crew). According to the website Titanic Facts there were 2,222 people on board (passengers and crew), while the dataset only has 891 passengers information (which represents 40% of the total number of people on board).
There could be a bias in how this dataset was collected since it only considers the people who paid for their ticket. There could be other people who were traveling in the Titanic who were traveling without a ticket (for example, crew members).
Another limitation is that correlation doesn't imply causation, and as a corollary we can say that lack of correlation doesn't imply lack of causation. To conclude causation, we would need to do some experiments but it's impossible in this particular case.
There are several questions that would be interesting to explore in the future with this dataset. One question could be if a passenger had one of their parents traveling in the titanic and the parent(s) survived, how likely is that the passenger survived as well. Similarly, if a passenger had one of their siblings or spouses traveling in the titanic and their sibling(s) or spouse survived, how likely is that the passenger survived as well. Another question could be if the port of embarkation correlates to the survivorship. In my opinion, the most important potential future work would be to determine which factors or mixture of factors were more predictive of survivorship in the titanic.
I invite you to download this notebook, the environment or the requirements.