Data Analysis: Kaggle Titanic Survived Predict
- OVERVIEW
START
1. Data Analysis
2. Feature Engineering
3. Modeling
끝

Data Analysis: Kaggle Titanic `Survived` Predict

AUTHOR: SungwookLE
DATE: ‘21.6/23
PROBLEM: Classifier Kaggle LINK
REFERENCE:

#1 LECTURE
#2 LECTURE
#3 LECTURE
The Challenge
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
Given Data
The data has been split into two groups:
training set (train.csv)
test set (test.csv)

Data Dictionary


Variable	Definition	Key
survival	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex
Age	Age in years
sibsp	# of siblings / spouses aboard the Titanic
parch	# of parents / children aboard the Titanic
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes
- pclass: A proxy for socio-economic status (SES)
  1st = Upper
  2nd = Middle
  3rd = Lower
- age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
- sibsp: The dataset defines family relations in this way…
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)
- parch: The dataset defines family relations in this way…
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

OVERVIEW

1) Data Analysis

데이터 차원, 형태 파악하기
그래프 그려서 에측변수 SalePrice와 다른 변수와의 상관관계 파악하기

2) Feature Engineering
2-1) categorical + numerical features 분리하기

using select_dtypes().
numerical 데이터 중 month나 year 등의 데이터는 categorical로 분류해주기 apply(str)

2-2) 비어있는 missing 데이터 채우기

numerical: mean, median, mode 를 활용하여 데이터 채우기 .fillna(xxx), mean(), median(), mode()
categorical: pd.get_dummies() 나 LabelEncoder를 활용해서 missing 데이터도 없애고, one-hot encoding도 완성하기

2-3) data의 skewness 줄이기

numerical data 의 skewness 줄이기

2-4) new feature / del feature

필요하다면

3) Modeling

CrossValidation using cross_val_score, KFold. train_test_split.
Regressor : LinearRegression, RidgeCV, LassoCV, ElasticNetCV
Classifier : KNN, RandomForest, ...
Techniques: StandardScaler, RobustScaler.
Easy modeling: make_pipeline

START

1. Data Analysis

데이터 차원, 형태 파악하기
그래프 그려서 에측변수 SalePrice와 다른 변수와의 상관관계 파악하기

from subprocess import check_output
import pandas as pd

print(check_output(["ls","input"]).decode('utf8'))

test.csv
train.csv

train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')

print("Initial train data shape is {}".format(train.shape))
n_train = train.shape[0]
train.head(3)

Initial train data shape is (891, 12)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S

print("Initial test data shape is {}".format(test.shape))
n_test = test.shape[0]
test.head(3)

Initial test data shape is (418, 11)

	PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	240276	9.6875	NaN	Q

all_data = pd.concat([train,test],axis=0).reset_index(drop=True)

unique_id = len(set(all_data['PassengerId']))
total_count = len(all_data)

diff = unique_id-total_count
print("Difference with unique-Id and total Count: {}".format(diff))

Difference with unique-Id and total Count: 0

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

y_label = all_data['Survived'][:n_train]
all_data.drop('PassengerId', axis=1, inplace=True)
all_data.drop('Survived',axis=1, inplace=True)

def bar_chart(feature):
    survived = train.loc[train['Survived']==1, feature].value_counts()
    dead = train.loc[train['Survived']==0, feature].value_counts()

    df = pd.DataFrame([survived, dead], index=['Survived', 'Dead'])
    df.plot(kind='bar', stacked=True, figsize=(10,5),title=("Survived with "+feature))

bar_chart('Sex')

svg

corrmat= train.corr()
f, ax = plt.subplots(figsize=(10,8))
sns.heatmap(corrmat, vmax=0.8, annot=True)

abs(corrmat['Survived']).sort_values(ascending =False)

Survived       1.000000
Pclass         0.338481
Fare           0.257307
Parch          0.081629
Age            0.077221
SibSp          0.035322
PassengerId    0.005007
Name: Survived, dtype: float64

svg

def facet_plot(feature, range_opt=None):
    facet = sns.FacetGrid(train, hue='Survived', aspect=4)
    facet.map(sns.kdeplot, feature, shade = True)

    if not range_opt:
        facet.set(xlim=(0, train[feature].max()))
    else:
        facet.set(xlim=range_opt)
    facet.add_legend()
    plt.title("Survived with "+feature)
    plt.show()

facet_plot('Age')

svg

all_data.isnull().sum()

Pclass         0
Name           0
Sex            0
Age          263
SibSp          0
Parch          0
Ticket         0
Fare           1
Cabin       1014
Embarked       2
dtype: int64

all_data.dtypes.value_counts()

object     5
int64      3
float64    2
dtype: int64

print("Train Y Label Data is {}".format(y_label.shape))
print("All Data is {}".format(all_data.shape))

Train Y Label Data is (891,)
All Data is (1309, 10)

2. Feature Engineering

2-1. Categorical + numerical features 분리하기

using select_dtypes()
numerical 데이터 중 month나 year 등의 데이터는 categorical로 분류해주기 apply(str)

all_data['Pclass']=all_data['Pclass'].apply(str)
print("Numerical Feature is {}".format(len(all_data.select_dtypes(exclude=object).columns)))
numerical_features = all_data.select_dtypes(exclude=object).columns
numerical_features

Numerical Feature is 4

Index(['Age', 'SibSp', 'Parch', 'Fare'], dtype='object')

print("Categorical Feature is {}".format(len(all_data.select_dtypes(include=object).columns)))
categorical_features = all_data.select_dtypes(include=object).columns
categorical_features

Categorical Feature is 6

Index(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')

2-2. 비어있는 missing 데이터 채우기

numerical: mean, median, mode 를 활용하여 데이터 채우기 .fillna(xxx), mean(), median(), mode()
categorical: pd.get_dummies() 나 LabelEncoder를 활용해서 missing 데이터도 없애고, one-hot encoding도 완성하기

all_data.isnull().sum()

Pclass         0
Name           0
Sex            0
Age          263
SibSp          0
Parch          0
Ticket         0
Fare           1
Cabin       1014
Embarked       2
dtype: int64

# 숫자 데이터
all_data['Age'].fillna(all_data.groupby('Sex')['Age'].transform('median'), inplace=True)

# 숫자 데이터
all_data['Fare'].fillna(all_data.groupby('Pclass')['Fare'].transform('median'), inplace=True)

all_data.drop('Cabin',axis=1, inplace=True)
all_data.drop('Ticket',axis=1, inplace=True)
categorical_features=categorical_features.drop('Cabin')
categorical_features=categorical_features.drop('Ticket')

# 카테고리칼 데이터
all_data['Embarked'].fillna(all_data['Embarked'].value_counts().sort_values(ascending=False).index[0],inplace=True)

all_data['Name']=all_data['Name'].str.extract('([A-Za-z]+)\.', expand=False)

all_data.isnull().sum()

Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

print("Missing(NA) Data is {}".format(all_data.isnull().values.sum()))

Missing(NA) Data is 0

pd.get_dummies로 categorical데이터 one-hot encoding 해주기

all_data=pd.get_dummies(all_data)

print("After fill and ONE-HOT encoding data shape is {}".format(all_data.shape))

After fill and ONE-HOT encoding data shape is (1309, 30)

2-3. data의 skewness 줄이기

y_label 데이터도 skewness 가 있으면 줄인다음에 학습하는 것이 학습결과에 이득: classifier 문제에서는 skewness를 확인할 수는 없지
numerical data 의 skewness 줄이기

from scipy import stats
from scipy.stats import norm, skew # for some statistics

skewness = all_data[numerical_features].apply(lambda x: skew(x.dropna()))
skewness = skewness.sort_values(ascending=False)
skewness_features = skewness[abs(skewness.values)>1].index
print("skewness:")
print(skewness_features)

plt.figure(figsize=(10,5))
plt.xticks(rotation='90')
sns.barplot(x=skewness.index, y=skewness.values)
plt.title('Before skewness elimination using log1p')

skewness:
Index(['Fare', 'SibSp', 'Parch'], dtype='object')

Text(0.5, 1.0, 'Before skewness elimination using log1p')

svg

sns.distplot(all_data['Fare'], fit=norm)

(mu,sigma) = norm.fit(all_data['Fare'])
plt.legend(['Normal dist. mu={:.2f}, std={:.2f}'.format(mu,sigma)], loc='best')
plt.ylabel('Frequency')
plt.title('Before skewness in Fare')

fig = plt.figure()
res = stats.probplot(all_data['Fare'], plot=plt)
plt.show()

svg

import numpy as np

#Fare, SibSp, Parch

for col in skewness_features:
    all_data[col] = np.log1p(all_data[col])

skewness = all_data[numerical_features].apply(lambda x: skew(x.dropna()))
skewness = skewness.sort_values(ascending=False)
print(skewness)

plt.figure(figsize=(10,5))
plt.xticks(rotation='90')
sns.barplot(x=skewness.index, y=skewness.values)
plt.title('After skewness elimination using log1p')

Parch    1.787711
SibSp    1.634945
Age      0.552731
Fare     0.542519
dtype: float64

Text(0.5, 1.0, 'After skewness elimination using log1p')

svg

sns.distplot(all_data['Fare'], fit=norm)

(mu,sigma) = norm.fit(all_data['Fare'])
plt.legend(['Normal dist. mu={:.2f}, std={:.2f}'.format(mu,sigma)], loc='best')
plt.ylabel('Frequency')
plt.title('After skewness in Fare')


fig = plt.figure()
res = stats.probplot(all_data['Fare'], plot=plt)
plt.show()

svg

2-4. new feature / del feature

필요하다면 하는 것이고, 여기선 하지 않겠다.

train_data = all_data[:n_train]
test_data = all_data[n_train:]

3. Modeling

CrossValidation using cross_val_score, KFold. train_test_split.
Regressor : LinearRegression, RidgeCV, LassoCV, ElasticNetCV
Classifier :
1) kNN (가까운 이웃)
2) Decision Tree (논리 순서)
3) Random Forest (논리 순서, 여러개 세트를 두고 다수결)
4) 베이지안 룰 (확률)
5) SVM (서포트 벡터 머신))
Techniques: StandardScaler, RobustScaler.
Easy modeling: make_pipeline

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
k_fold = KFold(n_splits = 10, shuffle = True, random_state = 0)

kNN = make_pipeline(RobustScaler(),KNeighborsClassifier(n_neighbors=13) )
score = cross_val_score(kNN, train_data, y_label, cv= k_fold, n_jobs =1 , scoring='accuracy')
print(np.mean(score))

0.8047066167290886

RandomForest = make_pipeline(RobustScaler(),RandomForestClassifier(n_estimators=13) )
score = cross_val_score(RandomForest, train_data, y_label, cv= k_fold, n_jobs =1 , scoring='accuracy')
print(np.mean(score))

0.7979525593008739

Bayes = make_pipeline(RobustScaler(),GaussianNB())
score = cross_val_score(Bayes, train_data, y_label, cv= k_fold, n_jobs =1 ,  scoring='accuracy')
print(np.mean(score))

0.6971161048689138

SV_clf = make_pipeline(RobustScaler(),SVC())
score = cross_val_score(SV_clf, train_data, y_label, cv= k_fold, n_jobs =1 ,  scoring='accuracy')
print(np.mean(score))

0.8338826466916354

#   SVM 모델이 정확도가 제일 좋으니까 83%로,, 이걸로 예측을 하자!
clf =  make_pipeline(RobustScaler(),SVC())
clf.fit(train_data, y_label)
train_prediction = clf.predict(train_data)
test_prediction = clf.predict(test_data)

#plot between predicted values and label
error = abs(train_prediction - y_label)
error = pd.Series(error)
error = pd.DataFrame(error.value_counts().values, index=error.value_counts().index.map({0:"True", 1:"False"}), columns=['Count'])
error.plot(kind='bar', figsize=(10,5))

<AxesSubplot:>

svg

test_prediction=test_prediction.astype(np.int)

# 출력하기
submission = pd.DataFrame({"PassengerId": test['PassengerId'],
  "Survived": test_prediction})

submission.to_csv('submission_wook.csv',index=False)

submission = pd.read_csv('submission_wook.csv')
submission.head()

	PassengerId	Survived
0	892	0
1	893	0
2	894	0
3	895	0
4	896	1

DataAnalysis Kaggle Titanic by Myself

Classifier- Classification with feature engineering

2021-06-23 10:10

Data Analysis: Kaggle Titanic `Survived` Predict

OVERVIEW

START

1. Data Analysis

2. Feature Engineering

2-1. Categorical + numerical features 분리하기

2-2. 비어있는 missing 데이터 채우기

2-3. data의 skewness 줄이기

2-4. new feature / del feature

3. Modeling

끝

DataAnalysis Kaggle Titanic by Myself

Classifier- Classification with feature engineering

2021-06-23 10:10

Data Analysis: Kaggle Titanic Survived Predict

OVERVIEW

START

1. Data Analysis

2. Feature Engineering

2-1. Categorical + numerical features 분리하기

2-2. 비어있는 missing 데이터 채우기

2-3. data의 skewness 줄이기

2-4. new feature / del feature

3. Modeling

끝

Data Analysis: Kaggle Titanic `Survived` Predict