[패스트캠퍼스 수강 후기] 데이터분석 인강 100% 환급 챌린지 45회차 미션

[패스트캠퍼스 수강 후기] 데이터분석 인강 100% 환급 챌린지 45회차 미션

08. ch 02. 분류 분석 - 03. 분류 분석과 로지스틱 회귀 모델 - 2 - 09. ch 02. 분류 분석 - 04. Logistic Regression을 이용한

08. ch 02. 분류 분석 - 03. 분류 분석과 로지스틱 회귀 모델

09. ch 02. 분류 분석 - 04. Logistic Regression을 이용한

3) 지도 학습 기반 분류 분석

3-1) 데이터 전처리

데이터 타입 변경

df['Legendary'] = df['Legendary'].astype(int)
df['Generation'] = df['Generation'].astype(str)
preprocessed_df = df[['Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']]

preprocessed_df.head()

one-hot encoding

encoded_df = pd.get_dummies(preprocessed_df['Type 1'])
encoded_df.head()

def make_list(x1, x2):
type_list = []
type_list.append(x1)
if x2 is not np.nan:
type_list.append(x2)
return type_list

preprocessed_df['Type'] = preprocessed_df.apply(lambda x : make_list(x['Type 1'], x['Type 2']), axis=1)
preprocessed_df.head()

del preprocessed_df['Type 1']
del preprocessed_df['Type 2']

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
preprocessed_df = preprocessed_df.join(pd.DataFrame(mlb.fit_transform(preprocessed_df.pop('Type')), columns=mlb.classes_))

preprocessed_df.head()

preprocessed_df = pd.get_dummies(preprocessed_df['Generation'])
preprocessed_df.head()

피처 표준화

scaler = StandardScaler()
scale_columns = ['Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']
preprocessed_df[scale_columns] = scaler.fit_transform(preprocessed_df[scale_columns])
preprocessed_df.head()

데이터셋 분리

from sklearn.model_selection import train_test_split

# dataset split to train/test
X = preprocessed_df.loc[:, preprocessed_df.columns != 'Legendary']
y = preprocessed_df['Legendary']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

print(x_train.shape)
print(x_test.shape)

3-2) Logistic Regression 모델 학습

모델 학습

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

lr = LogisticRegression(random_state=0)
lr.fit(x_train, y_train)

y_pred = lr.predict(x_test)

모델 평가

print("accuracy: %.2f" % accuracy_score(y_test, y_pred))
print("Precision : %.3f" % precision_score(y_test, y_pred))
print("Recall : %.3f" % recall_score(y_test, y_pred))
print("F1 : %.3f" % f1_score(y_test, y_pred))

from sklearn.metrics import confusion_matrix

confmat = confusion_matrix(y_true=y_test, y_pred=y_pred)
print(confmat)

3-3) 클래스 불균형 조정

preprocessed_df['Legendary'].value_counts()

1:1 샘플링

positive_random_idx = preprocessed_df[preprocessed_df['Legendary']==1].sample(65, random_state=33).index.tolist()
negative_random_idx = preprocessed_df[preprocessed_df['Legendary']==0].sample(65, random_state=33).index.tolist()

positive_random_idx

데이터셋 분리

random_idx = positive_random_idx + negative_random_idx
X = preprocessed_df.loc[random_idx, preprocessed_df.columns != 'Legendary']
y = preprocessed_df['Legendary'][random_idx]
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

print(x_train.shape)
print(x_test.shape)

모델 재학습

lr = LogisticRegression(random_state=0)
lr.fit(x_train, y_train)

y_pred = lr.predict(x_test)

print("accuracy: %.2f" % accuracy_score(y_test, y_pred))
print("Precision : %.3f" % precision_score(y_test, y_pred))
print("Recall : %.3f" % recall_score(y_test, y_pred))
print("F1 : %.3f" % f1_score(y_test, y_pred))

confmat = confusion_matrix(y_true=y_test, y_pred=y_pred)
print(confmat)

패스트캠퍼스 데이터분석 강의 링크
bit.ly/3imy2uN

마진사의 공개적 사생활

[패스트캠퍼스 수강 후기] 데이터분석 인강 100% 환급 챌린지 45회차 미션

티스토리툴바