Titanic: Survivorship vs ClassÂ¶

Alonso Silva (@alonsosilva) | 2017/08/18

Note: I'm using the Titanic dataset which can be downloaded here.

"Titanic" by formatc1 is licensed under CC BY-SA 2.0

First of all, I'm going to import some packages that I'll need to perform the analysis of this dataset

# Import numpy, pandas, matplotlib, and seaborn to analyze and plot the data
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Now, I'm going to load the dataset from the csv file (downloaded in the same directory as this notebook) to a dataframe called titanic

# Load the data from the csv file to a dataframe called titanic
titanic = pd.read_csv('titanic-data.csv')

I'll take a look at the first few rows of the dataset to see what does the dataset contain

# Look at the head of the dataset
titanic.head()

A more thorough description of the dataset can be found here.

Let's see how many passengers information does this dataset have

# Number of passengers in the dataset
len(titanic)

891

The quantity of passengers in the Titanic was approximately 1,317 passengers and the quantity of people on board (passengers and crew) was approximately 2,222 passengers (source: http://www.titanicfacts.net/titanic-passengers.html), which means that this is a subset of 891 passengers of the total dataset (which represents 68% of the passengers and 40% of people on board approximately).

I would like to answer if survivorship is related to the class you traveled.

Data WranglingÂ¶

I'll check if there is any missing information in these columns

# Count how many NaN values does each serie have
titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

The ages of 177 people are missing from the dataset. Similarly, 687 people are missing information about their cabin numbers. Let's see what's the percentage of the values that are not NaN to evaluate how much data are we missing.

'''
The function counts what's the percentage of is-not-NAN values compared
to the total quantity of values
'''
def percentage_of_is_not_NAN(column):
    length_column = len(column)
    length_column_with_no_NAN_values = len(column.dropna())
    return ((length_column_with_no_NAN_values * 1.0)/ length_column) * 100

print "Percentage of not NaN in Age:", percentage_of_is_not_NAN(titanic['Age'])
print "Percentage of not NaN in Cabin numbers:", percentage_of_is_not_NAN(titanic['Cabin'])

Percentage of not NaN in Age: 80.1346801347
Percentage of not NaN in Cabin numbers: 22.8956228956

We don't lose much data when we exclude the passengers that are missing information about their ages, however we may be losing quite a lot of data when we exclude the passengers who are missing information about their cabin numbers. We'll get back to how to handle both of those passengers groups when we'll use those columns.

Survivorship by classÂ¶

My initial thought is that different classes had different rates of survival because wealthy people may have been put on lifeboats first than the rest of the passengers. Let's see if that's the case.

I come back to use the original DataFrame 'titanic' since both variables survivorship and class don't have NaN values.

Let's take a look at how many passengers there are for each class

# How many passengers are there by each class
print titanic.groupby('Pclass')['Pclass'].value_counts()

Pclass  Pclass
1       1         216
2       2         184
3       3         491
Name: Pclass, dtype: int64

There are 216 passengers in the 1st class, 184 passengers in the 2nd class, and 491 passengers in the 3rd class.

Let's see how many passengers survived in each class

# How many passengers survived grouped by class
print titanic.groupby('Pclass')['Survived'].value_counts()

Pclass  Survived
1       1           136
        0            80
2       0            97
        1            87
3       0           372
        1           119
Name: Survived, dtype: int64

In the 1st class, 63% of the passengers survived (136 out of 216); in 2nd class, 47% of passengers survived (87 out of 184); while in the 3rd class, only 24% of the passengers survived (119 out of 491).

Correlation between survivorship and classÂ¶

We want to explore if there is a correlation between survivorship and class. For this, we look at the correlation between the survivorship and in which class did the passengers travel.

# Standardize each column of the given DataFrame.
def standardize(df):
    return (df - df.mean()) / df.std(ddof=0)

# Correlation between survivorship and class
np.corrcoef(standardize(titanic['Survived']),\
            standardize(titanic['Pclass']))

array([[ 1.        , -0.33848104],
       [-0.33848104,  1.        ]])

We see that indeed there is a negative correlation between survivorship and the class in which the passengers traveled ($r=-0.34$). That means that if you traveled in 3rd class you were less likely to survive than if you traveled in 2nd class and similarly, if you traveled in 2nd class you were less likely to survive than if you traveled in 1st class.

However, this issue may be due to the fact that more women were traveling in the first class and women survived more than men, or because more children were traveling in the first class and children survived more than adults or because of the location where the cabins of the first class were located on the boat. Let's explore these issues in the following.

Survivorship by genderÂ¶

As an initial hypothesis, I think that gender could be a good predictor of survival rate because women may have been given the priority to board the available boats first. Let's see if that's the case

# How many passengers of each gender
titanic.groupby('Sex')['Sex'].value_counts()

Sex     Sex   
female  female    314
male    male      577
Name: Sex, dtype: int64

There are many more men (577 or 65%) than women (314 or 35%) in the dataset.

Let's see how many passengers survived in each gender class

# How many passengers of each gender survived
print titanic.groupby('Sex')['Survived'].value_counts()

Sex     Survived
female  1           233
        0            81
male    0           468
        1           109
Name: Survived, dtype: int64

We notice that 74% of the women survived (233 out of 314) while only 19% of the men survived (109 out of 577).

Let's take a look at the correlation between survivorship and being female

# Correlation between survivorship and being female
titanic[['isFemale','isMale']] = pd.get_dummies(titanic['Sex'])
np.corrcoef(standardize(titanic['Survived']),\
            standardize(titanic['isFemale']))

array([[ 1.        ,  0.54335138],
       [ 0.54335138,  1.        ]])

There seems to be a high correlation ($r=0.54$) between being female and survivorship. This suggests that it may be the case that for example women were given the priority to board the boats. This is confirmed by the article which says that

When the lifeboats were finally lowered officers gave the order that "women and children" should go first. One hundred and fifteen men in first class and 147 men from second class are recorded as having stood back to make space available and as a result died.

Perhaps, gender is a confounding variable that explains the previous correlation between class and survivorship. Let's study this in the following.

Survivorship by class, taking into account genderÂ¶

We have seen that class is negatively correlated with survivorship, but it may be that the presence of an important confounding variable (gender) explains that correlation. Let's explore this question

# How many passengers of each class and gender survived
grouped = titanic.groupby(['Pclass', 'Sex'])
grouped['Survived'].value_counts()

Pclass  Sex     Survived
1       female  1            91
                0             3
        male    0            77
                1            45
2       female  1            70
                0             6
        male    0            91
                1            17
3       female  0            72
                1            72
        male    0           300
                1            47
Name: Survived, dtype: int64

In 1st class 97% of women survived (91 out of 94), in 2nd class 92% of women survived (70 out of 76), and in 3rd class 50% of women survived (72 out of 144).

In 1st class 37% of men survived (45 out of 122), in 2nd class 16% of men survived (17 out of 108), and in 3rd class 14% of men survived (47 out of 347).

For women as for men, it seems that being in the 1st class increased the chances of survivorship.

Let's see the correlation.

I have found the partial correlation function implemented here by fabianp, which I copy in the following cell

"""
Partial Correlation in Python (clone of Matlab's partialcorr)

This uses the linear regression approach to compute the partial 
correlation (might be slow for a huge number of variables). The 
algorithm is detailed here:

    http://en.wikipedia.org/wiki/Partial_correlation#Using_linear_regression

Taking X and Y two variables of interest and Z the matrix with all the variable minus {X, Y},
the algorithm can be summarized as

    1) perform a normal linear least-squares regression with X as the target and Z as the predictor
    2) calculate the residuals in Step #1
    3) perform a normal linear least-squares regression with Y as the target and Z as the predictor
    4) calculate the residuals in Step #3
    5) calculate the correlation coefficient between the residuals from Steps #2 and #4; 

    The result is the partial correlation between X and Y while controlling for the effect of Z


Date: Nov 2014
Author: Fabian Pedregosa-Izquierdo, f@bianp.net
Testing: Valentina Borghesani, valentinaborghesani@gmail.com
"""

from scipy import stats, linalg

def partial_corr(C):
    """
    Returns the sample linear partial correlation coefficients between pairs of variables in C, controlling 
    for the remaining variables in C.


    Parameters
    ----------
    C : array-like, shape (n, p)
        Array with the different variables. Each column of C is taken as a variable


    Returns
    -------
    P : array-like, shape (p, p)
        P[i, j] contains the partial correlation of C[:, i] and C[:, j] controlling
        for the remaining variables in C.
    """
    
    C = np.asarray(C)
    p = C.shape[1]
    P_corr = np.zeros((p, p), dtype=np.float)
    for i in range(p):
        P_corr[i, i] = 1
        for j in range(i+1, p):
            idx = np.ones(p, dtype=np.bool)
            idx[i] = False
            idx[j] = False
            beta_i = linalg.lstsq(C[:, idx], C[:, j])[0]
            beta_j = linalg.lstsq(C[:, idx], C[:, i])[0]

            res_j = C[:, j] - C[:, idx].dot( beta_i)
            res_i = C[:, i] - C[:, idx].dot(beta_j)
            
            corr = stats.pearsonr(res_i, res_j)[0]
            P_corr[i, j] = corr
            P_corr[j, i] = corr
        
    return P_corr

This implementation of partial correlation takes a DataFrame and computes the partial correlation of pairwise columns (without the head) controlling by all the other variables. However, it doesn't do the standardization, therefore we first standardize the dataframes of interest, and then compute the partial correlation between the different variables

C = standardize(titanic[['Survived', 'Pclass', 'isFemale']])
partial_corr(C)

array([[ 1.        , -0.32062263,  0.53466047],
       [-0.32062263,  1.        ,  0.06584405],
       [ 0.53466047,  0.06584405,  1.        ]])

The negative partial correlation between class and survivorship by taking into account gender is quite high in absolute value ($r_{X1,X2|X3}=-0.32$), which means that even by considering gender, there is a negative correlation between survivorship and the class in which the passengers traveled, even if lower than before considering gender ($r_{X1,X2}=-0.34$).

Survivorship if you're a childÂ¶

Since the priority for boarding the boats was given to women and children, we study the correlation between survivorship and the variable being a child. It's difficult to define precisely at which age (which is the information we have) does one stop being a child. I tried different values and they gave similar results, thus I'll consider being a child to be of any age between 0 and 12 years old.

First, we exclude the NaN values.

'''
Exclude nan values from both columns (NaN values are in the age column)
'''
df2 = titanic[['Survived', 'Pclass', 'Sex', 'Age']].dropna()

We create a new variable 'isChild' that will indicate if the passenger had age less or equal to 12 years old. Similarly, we create the indicator variable 'isFemale' if the gender of the passenger is female.

'''
Define two new variables 'isChild' if the age is <=12, and
'isFemale' if the gender of the passenger is female
'''
df2['isChild'] = (df2['Age'] <= 12)
df2[['isFemale','isMale']] = pd.get_dummies(df2['Sex'])

Let's see how many children there are in the dataset

# How many children in the dataset
df2.groupby('isChild')['isChild'].value_counts()

isChild  isChild
False    False      645
True     True        69
Name: isChild, dtype: int64

There are 10% of children (69 out of 714) in the dataset.

Let's see how many of them survived

# How many passengers survived
print "Population survivorship:\n", df2['Survived'].value_counts()
# How many children passengers survived
print "Children survivorship:\n", df2.groupby('isChild')['Survived'].value_counts()

Population survivorship:
0    424
1    290
Name: Survived, dtype: int64
Children survivorship:
isChild  Survived
False    0           395
         1           250
True     1            40
         0            29
Name: Survived, dtype: int64

58% of children survived (40 out of 69), compared to 41% of the dataset of the remaining population (290 out of 714). There seems to be a positive correlation between being a child and survivorship. Let's see if that's the case

# Correlation between survivorship and being a child
np.corrcoef(standardize(df2['Survived']),\
           standardize(df2['isChild']))

array([[ 1.        ,  0.11557923],
       [ 0.11557923,  1.        ]])

Indeed, there is a positive correlation $r=0.12$ between survivorship and being a child. Perhaps, being a child is a confounding variable that explains the previous correlation between class and survivorship. Let's study this in the following.

Survivorship by class, taking into account being a childÂ¶

We have seen that class is negatively correlated with survivorship, but it may be that the presence of an important confounding variable (being a child) explains that correlation. Let's explore this question

# How many passengers of each class and if they're children survived
grouped = df2.groupby(['Pclass', 'isChild'])
grouped['Survived'].value_counts()

Pclass  isChild  Survived
1       False    1           119
                 0            63
        True     1             3
                 0             1
2       False    0            90
                 1            66
        True     1            17
3       False    0           242
                 1            65
        True     0            28
                 1            20
Name: Survived, dtype: int64

In 1st class 75% of children survived (3 out of 4), in 2nd class 100% of children survived (17 out of 17), and in 3rd class 42% of children survived (20 out of 28).

In the not-children population, in 1st class 65% of passengers survived (119 out of 182), in 2nd class 42% of passengers survived (66 out of 156), and in 3rd class 21% of passengers survived (65 out of 307).

Let's see if there is a partial correlation.

'''
Partial correlations between survivorship, class, and being a child
'''
C = standardize(df2[['Survived', 'Pclass', 'isChild']])
partial_corr(C)

array([[ 1.        , -0.38504619,  0.18651652],
       [-0.38504619,  1.        ,  0.21377876],
       [ 0.18651652,  0.21377876,  1.        ]])

The negative partial correlation between class and survivorship by taking into account being a child is quite high in absolute value ($r_{X1,X2|X3}=-0.39$), which means that even by considering being a child, there is a negative correlation between survivorship and the class in which the passengers traveled, and it is even higher than before considering being a child ($r_{X1,X2}=-0.34$).

Survivorship by class, taking into account gender and being a childÂ¶

We have seen that the correlation between survivorship and class is strong and that it cannot be explained by gender or by being a child but perhaps combining both variables the partial correlation changes. Let's see if that's the case

'''
Partial correlations between survivorship, class, gender, and 
being a child
'''
C = standardize(df2[['Survived', 'Pclass', 'isFemale', 'isChild']])
partial_corr(C)

array([[ 1.        , -0.3542601 ,  0.51781537,  0.1622829 ],
       [-0.3542601 ,  1.        ,  0.04875831,  0.21377781],
       [ 0.51781537,  0.04875831,  1.        , -0.00525601],
       [ 0.1622829 ,  0.21377781, -0.00525601,  1.        ]])

The negative partial correlation between class and survivorship by taking into account gender and being a child is quite high in absolute value ($r_{X1,X2|X3,X4}=-0.35$), which means that even by considering gender and being a child, there is a negative correlation between survivorship and the class in which the passengers traveled, and it is even higher in absolute value than before considering both variables ($r_{X1,X2}=-0.34$).

Let's see how does the position you were in the cabin change these correlations.

Survivorship by cabin locationÂ¶

As it was previously observed, 687 out of the 891 passengers are missing information about their cabin location.

'''
New DataFrame taking out NaN values from the Cabin, the other variables
don't have NaN values
'''
df = titanic[['Survived', 'Pclass', 'Sex', 'Cabin']].dropna()
print "How many non nan values:", len(df)
print "How many unique values in the cabin:", len(df['Cabin'].unique())

How many non nan values: 204
How many unique values in the cabin: 147

There are 23% of values which are not 'NaN' (204 values), and from those values there are 147 unique values.

Let's see how do the remaining values distribute by class

# How many passengers are there by each class
print "Passengers by class:\n", titanic.groupby('Pclass')['Pclass'].value_counts()
# How many cabin passengers are there by each class
print "Remaining passengers by class:\n", df.groupby('Pclass')['Pclass'].value_counts()

Passengers by class:
Pclass  Pclass
1       1         216
2       2         184
3       3         491
Name: Pclass, dtype: int64
Remaining passengers by class:
Pclass  Pclass
1       1         176
2       2          16
3       3          12
Name: Pclass, dtype: int64

The remaining values represent 81% of the 1st class (176 out of 216), 9% of the 2nd class (16 out of 184), and 2% of the 3rd class (12 out of 491). There are not many remaining passengers from 2nd and 3rd class.

Let's look at the correlation between survivorship and class in the remaining dataset

# Correlation between survivorship and class
np.corrcoef(standardize(df['Survived']),\
            standardize(df['Pclass']))

array([[ 1.        , -0.03303228],
       [-0.03303228,  1.        ]])

There is no correlation between survivorship and class the passengers traveled in the remaining values, therefore we cannot use these remaining values to control by location. One thing we can do is to look at the locations where there were passengers from two or three different classes, and see if there are differences in their survivorship rates in these common locations

# Unique cabin numbers for different classes
grouped = df.groupby('Pclass')
class_1st = grouped.get_group(1)
class_2nd = grouped.get_group(2)
class_3rd = grouped.get_group(3)
print "Cabin numbers of 3rd class:\n", class_3rd['Cabin'].unique()
print "Cabin numbers of 2nd class:\n", class_2nd['Cabin'].unique()
print "Cabin numbers of 1st class:\n", class_1st['Cabin'].unique()

Cabin numbers of 3rd class:
['G6' 'F G73' 'F E69' 'E10' 'F G63' 'E121' 'F38']
Cabin numbers of 2nd class:
['D56' 'F33' 'E101' 'F2' 'F4' 'D' 'E77']
Cabin numbers of 1st class:
['C85' 'C123' 'E46' 'C103' 'A6' 'C23 C25 C27' 'B78' 'D33' 'B30' 'C52' 'B28'
 'C83' 'E31' 'A5' 'D10 D12' 'D26' 'C110' 'B58 B60' 'D47' 'B86' 'C2' 'E33'
 'B19' 'A7' 'C49' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35'
 'C87' 'B77' 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'C22 C26'
 'C106' 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124'
 'C91' 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E44' 'A34'
 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79'
 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68'
 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58'
 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'C62 C64' 'E24' 'C90' 'C45'
 'E8' 'B101' 'D45' 'C46' 'D30' 'D11' 'B3' 'D6' 'B82 B84' 'D17' 'A36' 'B102'
 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50' 'B42' 'C148']

We use NumPy function np.intersect1d to find the intersection between the different cabin numbers

# Intersection between different cabin numbers
cabin_1st = class_1st['Cabin'].unique()
cabin_2nd = class_2nd['Cabin'].unique()
cabin_3rd = class_3rd['Cabin'].unique()
print "Intersection between 1st and 2nd class:", np.intersect1d(class_1st, cabin_2nd)
print "Intersection between 1st and 3rd class:", np.intersect1d(cabin_1st, cabin_3rd)
print "Intersection between 1st and 2nd class:", np.intersect1d(cabin_2nd, cabin_3rd)

Intersection between 1st and 2nd class: []
Intersection between 1st and 3rd class: []
Intersection between 1st and 2nd class: []

We notice that there are too many unique values compared to the total number of values and that there is no intersection between the cabin numbers the different classes traveled. Perhaps, we have been too restrictive. Let's consider just the deck in which they traveled. I notice that the deck is indicated by the first letter from the website Encyclopedia Titanica. I want to consider just the letter as an approximation of the location. Let's do that

'''
Deck keeps the first letter of the cabin number, the other variables are
indicators of the Deck at which the passengers belonged
'''
df['Deck'] = df['Cabin'].apply(lambda x: x[0])
df['isDeckA'] = df['Deck'].apply(lambda x: x == 'A')
df['isDeckB'] = df['Deck'].apply(lambda x: x == 'B')
df['isDeckC'] = df['Deck'].apply(lambda x: x == 'C')
df['isDeckD'] = df['Deck'].apply(lambda x: x == 'D')
df['isDeckE'] = df['Deck'].apply(lambda x: x == 'E')
df['isDeckF'] = df['Deck'].apply(lambda x: x == 'F')
df['isDeckG'] = df['Deck'].apply(lambda x: x == 'G')

Let's see in which deck did each class traveled

# Unique decks for each class
grouped = df.groupby('Pclass')
class_1st = grouped.get_group(1)
class_2nd = grouped.get_group(2)
class_3rd = grouped.get_group(3)
deck_1st = class_1st['Deck'].unique()
deck_2nd = class_2nd['Deck'].unique()
deck_3rd = class_3rd['Deck'].unique()
print deck_1st
print deck_2nd
print deck_3rd

['C' 'E' 'A' 'B' 'D' 'T']
['D' 'F' 'E']
['G' 'F' 'E']

Let's see the decks in which there are intersections between the classes

# Intersection between different decks
print "Intersection between 1st and 2nd class:",\
np.intersect1d(deck_1st, deck_2nd)
print "Intersection between 1st and 3rd class:",\
np.intersect1d(deck_1st, deck_3rd)
print "Intersection between 1st and 2nd class:",\
np.intersect1d(deck_2nd, deck_3rd)

Intersection between 1st and 2nd class: ['D' 'E']
Intersection between 1st and 3rd class: ['E']
Intersection between 1st and 2nd class: ['E' 'F']

We notice that there are intersections between all the classes in deck 'E', while some intersection between 1st and 2nd also in deck 'D', and some intersection between 2nd and 3rd in deck 'F'.

We use DataFrame.query to create new indicator variables which will be true when the deck 'X' the passenger belonged was indeed 'X'. Then in these intersection decks, we look at the survivorship rates by class

# Survivorship by class in the intersection decks: D, E, F
DeckE = df.query('isDeckE == 1')
DeckD = df.query('isDeckD == 1')
DeckF = df.query('isDeckF == 1')
print "E Deck: Survivorship by class:\n",\
DeckE.groupby('Pclass')['Survived'].value_counts()
print "D Deck: Survivorship by class:\n",\
DeckD.groupby('Pclass')['Survived'].value_counts()
print "F Deck: Survivorship by class:\n",\
DeckF.groupby('Pclass')['Survived'].value_counts()

E Deck: Survivorship by class:
Pclass  Survived
1       1           18
        0            7
2       1            3
        0            1
3       1            3
Name: Survived, dtype: int64
D Deck: Survivorship by class:
Pclass  Survived
1       1           22
        0            7
2       1            3
        0            1
Name: Survived, dtype: int64
F Deck: Survivorship by class:
Pclass  Survived
2       1           7
        0           1
3       0           4
        1           1
Name: Survived, dtype: int64

In the E Deck: 72% of 1st class passengers survived (18 out of 25), 75% of 2nd class passengers survived (3 out of 4), and 100% of 3rd class passengers survived (3 out of 3).

In the D Deck: 76% of 1st class passengers survived (22 out of 29); while 75% of 2nd class passengers survived (3 out of 4).

In the F Deck: 88% of 2nd class passengers survived (7 out of 8); while 20% of 3rd class passengers survived (1 out of 5).

Although, it's difficult to conclude much with such low numbers, there doesn't seem to make a difference (at least from the numbers in the E Deck and in the D Deck) at which class you belonged. It seems to make a difference in which deck you were located.

As we dicover from the following article about myths related to the titanic

Each class of passengers had access to their own decks and allocated lifeboats - although crucially no lifeboats were stored in the third class sections of the ship.

Third class passengers had to find their way through a maze of corridors and staircases to reach the boat deck. First and second class passengers were most likely to reach the lifeboats as the boat deck was a first and second class promenade.

This explains thus the strong correlation that we found between class and survivorship. Apparently, this was to comply with immigration rules at the time

Gates did exist which barred the third class passengers from the other passengers. But this was not in anticipation of a shipwreck but in compliance with US immigration laws and the feared spread of infectious diseases.

Third class passengers included Armenians, Chinese, Dutch, Italians, Russians, Scandinavians and Syrians as well as those from the British Isles - all in search of a new life in America.

The conclusion of the inquiry that was made at the time concluded

"No evidence has been given in the course of this case that would substantiate a charge that any attempt was made to keep back the third class passengers."

SummaryÂ¶

What was the average ticket fare for first class, second class, and third class? The average ticket fare was \$13.68 for 3rd class, \$20.66 for 2nd class, and \$84,15 for 1st class.
Did older people pay more than younger people ? There is no correlation between the age of a passenger and the fare she paid.
Is survivorship related to the class you traveled? There is a negative correlation between the class you traveled and your survivorship. If you traveled in 1st class you were more likely to survive than if you traveled in 2nd class or 3rd class. However, this appears to be due to the location in which you traveled more than related to the class ticket you bought.

Potential limitations with these resultsÂ¶

There are several potential limitations with these results. For example, the dataset doesn't consider information of all the people on board (passengers and crew). According to the website Titanic Facts there were 2,222 people on board (passengers and crew), while the dataset only has 891 passengers information (which represents 40% of the total number of people on board).

There could be a bias in how this dataset was collected since it only considers the people who paid for their ticket. There could be other people who were traveling in the Titanic who were traveling without a ticket (for example, crew members).

Another limitation is that correlation doesn't imply causation, and as a corollary we can say that lack of correlation doesn't imply lack of causation. To conclude causation, we would need to do some experiments but it's impossible in this particular case.

Future work with this datasetÂ¶

There are several questions that would be interesting to explore in the future with this dataset. One question could be if a passenger had one of their parents traveling in the titanic and the parent(s) survived, how likely is that the passenger survived as well. Similarly, if a passenger had one of their siblings or spouses traveling in the titanic and their sibling(s) or spouse survived, how likely is that the passenger survived as well. Another question could be if the port of embarkation correlates to the survivorship. In my opinion, the most important potential future work would be to determine which factors or mixture of factors were more predictive of survivorship in the titanic.

I invite you to download this notebook, the environment or the requirements.

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S