KAGGLE: HEART ATTACK PREDICT
- About this dataset
- OVERVIEW
1) Data Analysis
2) Feature Engineering
3) Modeling
끝

KAGGLE: HEART ATTACK PREDICT

AUTHOR: SungwookLE
DATE: ‘21.7/4
DATASET: https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset

heart

About this dataset

Age : Age of the patient
Sex : Sex of the patient
exang: exercise induced angina (1 = yes; 0 = no)
ca: number of major vessels (0-3)
cp : Chest Pain type chest pain type
Value 1: typical angina
Value 2: atypical angina
Value 3: non-anginal pain
Value 4: asymptomatic
trtbps : resting blood pressure (in mm Hg)
chol : cholestoral in mg/dl fetched via BMI sensor
fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
rest_ecg : resting electrocardiographic results Value 0: normal
Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
thalach : maximum heart rate achieved
target :
0= less chance of heart attack
1= more chance of heart attack

OVERVIEW

1) Data Analysis

데이터 차원, 형태 파악하기
그래프 그려서 에측변수 ``와 다른 변수와의 상관관계 파악하기

2) Feature Engineering
2-1) categorical + numerical features 분리하기

using select_dtypes().
numerical 데이터 중 month나 year 등의 데이터는 categorical로 분류해주기 apply(str)

2-2) 비어있는 missing 데이터 채우기

numerical: mean, median, mode 를 활용하여 데이터 채우기 .fillna(xxx), mean(), median(), mode()
categorical: pd.get_dummies() 나 LabelEncoder를 활용해서 missing 데이터도 없애고, one-hot encoding도 완성하기

2-3) data의 skewness 줄이기

numerical data 의 skewness 줄이기

2-4) new feature / del feature

필요하다면

3) Modeling

CrossValidation using cross_val_score, KFold. train_test_split.
Regressor : LinearRegression, RidgeCV, LassoCV, ElasticNetCV
Classifier : KNN, RandomForest, ...
Techniques: StandardScaler, RobustScaler.
Easy modeling: make_pipeline

import numpy as np
import pandas as pd
from subprocess import check_output

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

print(check_output(["ls","input"]).decode('utf8'))

heart.csv
o2Saturation.csv

1) Data Analysis

데이터 차원, 형태 파악하기
그래프 그려서 에측변수 output 다른 변수와의 상관관계 파악하기

heart = pd.read_csv("input/heart.csv")
o2Saturation = pd.read_csv("input/o2Saturation.csv")

heart.shape, o2Saturation.shape

((303, 14), (3585, 1))

corr = heart.corr()
f, ax = plt.subplots(figsize=(10,8))
sns.heatmap(corr, vmax=0.8, annot=True)
abs(corr['output']).sort_values(ascending=False)

output      1.000000
exng        0.436757
cp          0.433798
oldpeak     0.430696
thalachh    0.421741
caa         0.391724
slp         0.345877
thall       0.344029
sex         0.280937
age         0.225439
trtbps      0.144931
restecg     0.137230
chol        0.085239
fbs         0.028046
Name: output, dtype: float64

svg

# ALL DATA TYPES are numerical data
heart.dtypes

age           int64
sex           int64
cp            int64
trtbps        int64
chol          int64
fbs           int64
restecg       int64
thalachh      int64
exng          int64
oldpeak     float64
slp           int64
caa           int64
thall         int64
output        int64
dtype: object

heart.isnull().sum()

age         0
sex         0
cp          0
trtbps      0
chol        0
fbs         0
restecg     0
thalachh    0
exng        0
oldpeak     0
slp         0
caa         0
thall       0
output      0
dtype: int64

def facet_plot(feature, range_opt=None):
    facet = sns.FacetGrid(heart, hue='output', aspect=4)
    facet.map(sns.kdeplot, feature, shade = True)

    if not range_opt:
        facet.set(xlim=(0, heart[feature].max()))
    else:
        facet.set(xlim=range_opt)
    facet.add_legend()
    plt.title("Output: "+feature)

for i in heart.columns:
    if ( i != 'output'):
        facet_plot(i)

svg

heart['cp'].value_counts()

  143
   87
   50
   23
Name: cp, dtype: int64

2) Feature Engineering

2-1) categorical + numerical features 분리하기

using select_dtypes().
numerical 데이터 중 month나 year 등의 데이터는 categorical로 분류해주기 apply(str)

heart['sex']=heart['sex'].apply(str)
# 사실 안해줘도 되는데, 그냥 해준거

heart.dtypes

age           int64
sex          object
cp            int64
trtbps        int64
chol          int64
fbs           int64
restecg       int64
thalachh      int64
exng          int64
oldpeak     float64
slp           int64
caa           int64
thall         int64
output        int64
dtype: object

from sklearn.preprocessing import LabelEncoder

lbl = LabelEncoder()
lbl.fit(heart['sex'])
heart['sex'] =  lbl.transform(heart['sex'].values)

heart['sex'].value_counts()

1    207
0     96
Name: sex, dtype: int64

2-2) 비어있는 missing 데이터 채우기

numerical: mean, median, mode 를 활용하여 데이터 채우기 .fillna(xxx), mean(), median(), mode()
categorical: pd.get_dummies() 나 LabelEncoder를 활용해서 missing 데이터도 없애고, one-hot encoding도 완성하기

비어있는 데이터가 없네요,

2-3) data의 skewness 줄이기

numerical data 의 skewness 줄이기

#비어있는 데이터 없음
heart.isnull().sum()

age         0
sex         0
cp          0
trtbps      0
chol        0
fbs         0
restecg     0
thalachh    0
exng        0
oldpeak     0
slp         0
caa         0
thall       0
output      0
dtype: int64

from scipy import stats
from scipy.stats import norm, skew # for some statistics

skewness = heart.apply(lambda x: skew(x.dropna()))
skewness = skewness.sort_values(ascending=False)
skewness_features = skewness[abs(skewness.values)>1].index
print("skewness:")
print(skewness_features)

skewness:
Index(['fbs', 'caa', 'oldpeak', 'chol'], dtype='object')

sns.distplot(heart['oldpeak'], fit=norm)
(mu,sigma)=norm.fit(heart['oldpeak'])
plt.legend(['Normal dist. mu={:.2f}, std={:.2f}'.format(mu,sigma)], loc='best')
plt.ylabel('Frequency')
plt.title('Before skewness in oldpeak, skew is {:.3f}'.format(skewness['oldpeak']))
plt.show()

heart['oldpeak'] = np.log1p(heart['oldpeak'])
sns.distplot(heart['oldpeak'], fit=norm)
(mu,sigma)=norm.fit(heart['oldpeak'])
plt.legend(['Normal dist. mu={:.2f}, std={:.2f}'.format(mu,sigma)], loc='best')
plt.ylabel('Frequency')
plt.title('After skewness in oldpeak, skew is  {:.3f}'.format(skew(heart['oldpeak'])))
plt.show()

sns.distplot(heart['chol'], fit=norm)
(mu,sigma)=norm.fit(heart['chol'])
plt.legend(['Normal dist. mu={:.2f}, std={:.2f}'.format(mu,sigma)], loc='best')
plt.ylabel('Frequency')
plt.title('Before skewness in chol, skew is {:.3f}'.format(skewness['chol']))
plt.show()

heart['chol'] = np.log1p(heart['chol'])
sns.distplot(heart['chol'], fit=norm)
(mu,sigma)=norm.fit(heart['chol'])
plt.legend(['Normal dist. mu={:.2f}, std={:.2f}'.format(mu,sigma)], loc='best')
plt.ylabel('Frequency')
plt.title('After skewness in chol, skew is {:.3f}'.format(skew(heart['chol'])))
plt.show()

svg

#QQ plots
fig = plt.figure()
res = stats.probplot(heart['chol'], plot=plt)
plt.show()

svg

skewness = heart.apply(lambda x: skew(x.dropna()))
skewness = skewness.sort_values(ascending=False)
skewness_features = skewness[abs(skewness.values)>1].index
print("skewness:")
print(skewness_features)

skewness:
Index(['fbs', 'caa'], dtype='object')

3) Modeling

CrossValidation using cross_val_score, KFold. train_test_split.
Regressor : LinearRegression, RidgeCV, LassoCV, ElasticNetCV
Classifier : KNN, RandomForest, MLPClassifier...
Techniques: StandardScaler, RobustScaler.
Easy modeling: make_pipeline

label = heart['output']
heart.drop('output',axis=1, inplace=True)
heart.head()

	age	sex	cp	trtbps	chol	fbs	restecg	thalachh	exng	oldpeak	slp	thall
0	63	1	3	145	5.455321	1	0	150	0	1.193922	0	1
1	37	1	2	130	5.525453	0	1	187	0	1.504077	0	2
2	41	0	1	130	5.323010	0	0	172	0	0.875469	2	2
3	56	1	1	120	5.468060	0	1	178	0	0.587787	2	2
4	57	0	0	120	5.872118	0	1	163	1	0.470004	2	2

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

k_fold = KFold(n_splits = 10, shuffle = True, random_state = 0)

RandomForest = make_pipeline(StandardScaler(),RandomForestClassifier(n_estimators=50) )
score = cross_val_score(RandomForest, heart, label, cv= k_fold, n_jobs =1 , scoring='accuracy')
print('RandomForest CrossValidation Score is {:.5f}'.format(np.mean(score)))

RandomForest CrossValidation Score is 0.82215

MLP = make_pipeline(StandardScaler(), MLPClassifier(learning_rate='adaptive'))
score = cross_val_score(MLP, heart, label, cv= k_fold, n_jobs =1 , scoring='accuracy')
print('MLP CrossValidation Score is {:.5f}'.format(np.mean(score)))

MLP CrossValidation Score is 0.84172

Best Model is SVM as 83.495%

SVM = make_pipeline(StandardScaler(), SVC())
score = cross_val_score(SVM, heart, label, cv= k_fold, n_jobs =1 , scoring='accuracy')
print('SVM CrossValidation Score is {:.5f}'.format(np.mean(score)))

SVM CrossValidation Score is 0.83495

KNN = make_pipeline(StandardScaler(), KNeighborsClassifier())
score = cross_val_score(KNN, heart, label, cv= k_fold, n_jobs =1 , scoring='accuracy')
print('KNN CrossValidation Score is {:.5f}'.format(np.mean(score)))

KNN CrossValidation Score is 0.83194

sts = StandardScaler()
sts.fit(heart)
feed = sts.transform(heart)
feed = pd.DataFrame(feed, columns= heart.columns)

#skewness= feed.apply(lambda x: skew(x.dropna()))
#print(skewness)
#from sklearn.cluster import KMeans
#KMM = KMeans(n_clusters=2)
#KMM.fit(feed)

SVM.fit(feed, label)
pred = SVM.predict(feed)
pred = pd.DataFrame(pred, columns=['output'])

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(feed)

X = pca.transform(feed)
X= pd.DataFrame(X)
X=X.rename(columns={0:'one', 1:'two'})
X = pd.concat([X, pred], axis=1)

colors = ['red','blue']
labels=[0,1]
legends = ['less chance of heart atk.', 'more chance of heart atk.']

for la, color, leg in zip(labels, colors, legends)W:
    plt.scatter(x=X.loc[X['output']==la]['one'], y = X.loc[X['output']==la]['two'], c =color, s=3, label=leg)
plt.legend()

<matplotlib.legend.Legend at 0x7f5afa0d7b38>

svg

끝

데이터 프로세스 & 기계학습 순서를 따라가면, 어느정도의 predict 성능은 나오는데, 83% 정도에서 더 끌어올리기 위해선, feature engineering을 신경써서 해주어야 한다.

DataAnalysis Kaggle HeartAttack

Classifier- Heart Attack probability high or low?

2021-07-04 10:10

KAGGLE: HEART ATTACK PREDICT

About this dataset

OVERVIEW

1) Data Analysis

2) Feature Engineering

2-1) categorical + numerical features 분리하기

2-2) 비어있는 missing 데이터 채우기

2-3) data의 skewness 줄이기

3) Modeling

끝