Data Analysis: Kaggle Titanic

AUTHOR: SungwookLE
DATE: ‘21.6/21

  • numerical data 의 binning techniques
  • categorical data 의 map method 확인할 수 있음

KEGGLE #1

The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

image

Overview

The data has been split into two groups:

  • training set (train.csv)
  • test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

Data Dictionary

Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex  
Age Age in years  
sibsp # of siblings / spouses aboard the Titanic  
parch # of parents / children aboard the Titanic  
ticket Ticket number  
fare Passenger fare  
cabin Cabin number  
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

'''
패키지 라이브러리 (pandas, numpy) import
'''
import pandas as pd
import numpy as np
train_original = pd.read_csv('input/train.csv')
test_original = pd.read_csv('input/test.csv')

train = train_original.copy()
test = test_original.copy()

데이터 분석

아래의 순서대로 진행, STEP마다 DATA VISUALIZATION & DISCUSSION

  • 1) 데이터 살펴보기
  • 2) Feature 추출
  • 3) 학습

1단계: 데이터 살펴보기

  • 데이터를 직접 눈으로 보고, 결측데이터가 있는지,
  • 데이터 항목과 Survived는 어떤 연관성이 있는지 살펴보고 feature로 선택할 필요가 없을지 고민하기
# 1단계 데이터 살펴보기
print("SHAPE: ", train.shape)
print(train.isnull().sum())
train.head(3)
SHAPE:  (891, 12)
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
print("SHAPE: ", test.shape)
print(test.isnull().sum())
test.head(3)
SHAPE:  (418, 11)
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
# VISUALIZATION 라이브러리리 호출
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set() #Setting Seaborn default for plots
def bar_chart(df, feature):
    survived = df.loc[df['Survived']==1,feature].value_counts()
    dead = df.loc[df['Survived']==0, feature].value_counts()
    mat = pd.DataFrame([survived, dead])
    mat.index = ['Survived','Dead']
    mat.plot(kind='bar', stacked = True, figsize=(10,5))

bar_chart(train, 'Sex') #남자가 많이 죽고, 여자가 많이 살았네 (strongly)
bar_chart(train, 'Pclass') # 3등급이 많이 죽긴 하였네 
bar_chart(train, 'SibSp') # 가족이 없는 경우네는 많이 죽긴 하였네 (weakly)
bar_chart(train, 'Parch') 
bar_chart(train, 'Embarked') 

svg

svg

svg

svg

svg

# 1등급/2등급/3등급 승객의 탑승지 bar_chart
pclass1 = train.loc[train['Pclass']==1, 'Embarked'].value_counts()
pclass2 = train.loc[train['Pclass']==2, 'Embarked'].value_counts()
pclass3 = train.loc[train['Pclass']==3, 'Embarked'].value_counts()
sset = pd.DataFrame([pclass1, pclass2, pclass3])
sset.index = ['Pclass1','Pclass2', 'Pclass3']
sset.plot(kind='bar', stacked = True, figsize=(10,5))
<AxesSubplot:>

svg

2단계: Feature 추출하기

  • 결측데이터를 채워주거나 삭제하기
  • 없는 데이터를 추가로 추출하여 Feature로 활용하기 (이름에서 Mr인지 Ms인지 추출)
  • 문자데이터를 숫자로 변형해주기(Mr 면 1 , Ms면 2 , Dr면 3 이런식)
  • Feature Range를 조절해주기 (Age를 10대/20대/30대 이런식으로)
  • Feature와 Label(정답지)를 분리하기
# 이름에서 성별 정보등을 Title이라는 키로 추출하기
train_test_data = [train, test]

for dataset in train_test_data:
    dataset['Title']=dataset['Name'].str.extract('([A-Za-z]+)\.', expand=False)

train['Title'].value_counts()
Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Countess      1
Jonkheer      1
Lady          1
Capt          1
Ms            1
Sir           1
Don           1
Mme           1
Name: Title, dtype: int64
train.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
Title            0
dtype: int64
title_mapping = {"Mr": 0, "Miss": 1, "Mrs": 2, "Master":3, "Dr": 3, "Rev": 3, "Col": 3 , "Major":3, "Mlle":3, "Countess":3, "Ms": 3, "Lady": 3 , "Jonkheer": 3, "Don": 3, "Dona": 3 , "Mme":3, "Capt": 3, "Sir":3}

for dataset in train_test_data:
    dataset['Title']=dataset['Title'].map(title_mapping)
bar_chart(train, 'Title')
train['Title'].value_counts() #보면 Mr(0)는 확실히 많이 죽었음
train.drop('Name', axis=1, inplace=True)
test.drop('Name', axis=1, inplace=True)

svg

sex_mapping = {'male': 0, "female": 1}

for dataset in train_test_data:
    dataset['Sex'] = dataset['Sex'].map(sex_mapping)
# 결측데이터 채우기 (물론 버려도 됨)
# 여기서는 Mr / Ms / Mrs 그룹들의 평균 나이를 결측데이터로 넣어줄것임
train['Age'].fillna(train.groupby('Title')['Age'].transform('median'), inplace=True)
test['Age'].fillna(test.groupby('Title')['Age'].transform('median'), inplace=True)
train.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
Title            0
dtype: int64
def facet_plot(df,feature, range_opt=None):
    facet = sns.FacetGrid(df, hue='Survived', aspect=4)
    facet.map(sns.kdeplot, feature, shade = True)

    if not range_opt:
        facet.set(xlim=(0, train[feature].max()))
    else:
        facet.set(xlim=range_opt)
    facet.add_legend()

    plt.show()
facet_plot(train,'Age')
facet_plot(train,'Age', [0,20])
facet_plot(train,'Age', [20,35]) # 20~35세는 많이 죽은 것을 볼 수 있네
facet_plot(train,'Age', [35,80])

svg

svg

svg

svg

Blinning

  • Blinning/Converting Numerical Age to Categorical Variable

  • feature vector map:

    • child: 0
    • young: 1
    • adult: 2
    • mid-age: 3
    • senior: 4
for dataset in train_test_data:
    dataset.loc[ dataset['Age'] <= 16 , 'Age'] =0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <=26), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 26) & (dataset['Age'] <=36), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 36) & (dataset['Age'] <=62), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 62  , 'Age'] = 4

# Embarked 결측데이터 채우기
# 보통 S에서 탔더라고,S로 채우자

embarked_mapping={"S": 0, "C": 1, "Q": 2}

for dataset in train_test_data:
    dataset['Embarked']=dataset['Embarked'].fillna('S')
    dataset['Embarked'] = dataset['Embarked'].map(embarked_mapping)

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
Title            3
dtype: int64
facet_plot(train, 'Fare')
facet_plot(train, 'Fare', [0,30]) #이 구간은 많이 죽음
facet_plot(train, 'Fare', [30,150]) # 이 이후부터는 많이 살았음

svg

svg

svg

# binning
for dataset in train_test_data:
    dataset.loc[dataset['Fare'] <= 17, 'Fare'] =0 
    dataset.loc[(dataset['Fare'] > 17) & (dataset['Fare'] <=30), 'Fare'] =1 
    dataset.loc[(dataset['Fare'] > 30) & (dataset['Fare'] <=100), 'Fare'] =2
    dataset.loc[dataset['Fare'] > 100, 'Fare'] =3  
    
train['Fare'].value_counts()
0.0    496
2.0    181
1.0    161
3.0     53
Name: Fare, dtype: int64
facet_plot(train,'Fare')

svg

train['Cabin'].value_counts()
B96 B98        4
C23 C25 C27    4
G6             4
E101           3
D              3
              ..
D56            1
E10            1
A26            1
A6             1
F G63          1
Name: Cabin, Length: 147, dtype: int64
cabin_mapiing={"A": 0, "B": 0.4 , "C": 0.8, "D": 1.2, "E": 1.6, "F": 2, "G": 2.4, "T": 2.8}
for dataset in train_test_data:
    dataset['Cabin'] =dataset['Cabin'].str[:1]
    dataset['Cabin'] = dataset['Cabin'].map(cabin_mapiing)

train['Cabin'].fillna( train.groupby('Pclass')['Cabin'].transform('median'), inplace=True)
test['Cabin'].fillna( test.groupby('Pclass')['Cabin'].transform('median'), inplace=True)

test['Cabin'].value_counts()
2.0    308
0.8     62
0.4     18
1.2     13
1.6      9
0.0      7
2.4      1
Name: Cabin, dtype: int64
test.isnull().sum()
PassengerId    0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           1
Cabin          0
Embarked       0
Title          0
dtype: int64
train['Fare'].fillna(train.groupby('Pclass')['Fare'].transform('median'), inplace=True)
test['Fare'].fillna(train.groupby('Pclass')['Fare'].transform('median'), inplace=True)
train['FamilySize'] = train['SibSp'] + train['Parch']+1
test['FamilySize'] = test['SibSp'] + test['Parch'] +1
facet_plot(train, "FamilySize")

svg

family_mapping = {1: 0, 2: 0.4, 3: 0.8, 4: 1.2, 5:1.6, 6:2, 7:2.4, 8:2.8, 9:3.2, 10:3.6, 11:4}

for dataset in train_test_data:
    dataset['FamilySize'] = dataset['FamilySize'].map(family_mapping)
features_drop = ['Ticket', 'SibSp', 'Parch', 'PassengerId']
train = train.drop(features_drop, axis= 1)
test = test.drop(features_drop, axis=1)
train_data = train.drop('Survived', axis =1)
target = train['Survived']

train_data.shape, target.shape, test.shape
((891, 8), (891,), (418, 8))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Pclass      418 non-null    int64  
 1   Sex         418 non-null    int64  
 2   Age         418 non-null    float64
 3   Fare        417 non-null    float64
 4   Cabin       418 non-null    float64
 5   Embarked    418 non-null    int64  
 6   Title       418 non-null    int64  
 7   FamilySize  418 non-null    float64
dtypes: float64(4), int64(4)
memory usage: 26.2 KB
train_data.head(3)
Pclass Sex Age Fare Cabin Embarked Title FamilySize
0 3 0 1.0 0.0 2.0 0 0 0.4
1 1 1 3.0 2.0 0.8 1 2 0.4
2 3 1 1.0 0.0 2.0 0 1 0.0

3단계: 학습

  • 예측모델(Classifier: kNN, RandomForest, Baysian, SVM)
    1) kNN (가까운 이웃)
    2) Decision Tree (논리 순서)
    3) Random Forest (논리 순서, 여러개 세트를 두고 다수결)
    4) 베이지안 룰 (확률)
    5) SVM (서포트 벡터 머신))
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
k_fold = KFold(n_splits = 10, shuffle = True, random_state = 0)
clf = KNeighborsClassifier(n_neighbors=13)
scoring = 'accuracy'
score = cross_val_score(clf, train_data, target, cv= k_fold, n_jobs =1 , scoring=scoring)
print(np.mean(score))
0.8260424469413232
clf = RandomForestClassifier(n_estimators=13)
score = cross_val_score(clf, train_data, target, cv= k_fold, n_jobs =1 , scoring=scoring)
print(np.mean(score))

0.8103370786516854
clf = GaussianNB()
score = cross_val_score(clf, train_data, target, cv= k_fold, n_jobs =1 , scoring=scoring)
print(np.mean(score))
0.7878027465667914
clf = SVC()
score = cross_val_score(clf, train_data, target, cv= k_fold, n_jobs =1 , scoring=scoring)
print(np.mean(score))
0.8350187265917602
clf = SVC() #SVM 모델이 정확도가 제일 좋으니까 83%로,, 이걸로 예측을 하자!
clf.fit(train_data, target)

prediction = clf.predict(test)
# 출력하기
submission = pd.DataFrame\
({"PassengerId": test_original['PassengerId'],
  "Survived": prediction })

submission.to_csv('submission_wook.csv',index=False)
submission = pd.read_csv('submission_wook.csv')
submission.head()
PassengerId Survived
0 892 0
1 893 1
2 894 0
3 895 0
4 896 1