[패스트캠퍼스 수강 후기] 데이터분석 인강 100% 환급 챌린지 45회차 미션
08. ch 02. 분류 분석 - 03. 분류 분석과 로지스틱 회귀 모델 - 2 - 09. ch 02. 분류 분석 - 04. Logistic Regression을 이용한
08. ch 02. 분류 분석 - 03. 분류 분석과 로지스틱 회귀 모델
09. ch 02. 분류 분석 - 04. Logistic Regression을 이용한
3) 지도 학습 기반 분류 분석
3-1) 데이터 전처리
데이터 타입 변경
df['Legendary'] = df['Legendary'].astype(int)
df['Generation'] = df['Generation'].astype(str)
preprocessed_df = df[['Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']]
preprocessed_df.head()
one-hot encoding
encoded_df = pd.get_dummies(preprocessed_df['Type 1'])
encoded_df.head()
def make_list(x1, x2):
type_list = []
type_list.append(x1)
if x2 is not np.nan:
type_list.append(x2)
return type_list
preprocessed_df['Type'] = preprocessed_df.apply(lambda x : make_list(x['Type 1'], x['Type 2']), axis=1)
preprocessed_df.head()
del preprocessed_df['Type 1']
del preprocessed_df['Type 2']
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
preprocessed_df = preprocessed_df.join(pd.DataFrame(mlb.fit_transform(preprocessed_df.pop('Type')), columns=mlb.classes_))
preprocessed_df.head()
preprocessed_df = pd.get_dummies(preprocessed_df['Generation'])
preprocessed_df.head()
피처 표준화
scaler = StandardScaler()
scale_columns = ['Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']
preprocessed_df[scale_columns] = scaler.fit_transform(preprocessed_df[scale_columns])
preprocessed_df.head()
데이터셋 분리
from sklearn.model_selection import train_test_split
# dataset split to train/test
X = preprocessed_df.loc[:, preprocessed_df.columns != 'Legendary']
y = preprocessed_df['Legendary']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
print(x_train.shape)
print(x_test.shape)
3-2) Logistic Regression 모델 학습
모델 학습
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
lr = LogisticRegression(random_state=0)
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)
모델 평가
print("accuracy: %.2f" % accuracy_score(y_test, y_pred))
print("Precision : %.3f" % precision_score(y_test, y_pred))
print("Recall : %.3f" % recall_score(y_test, y_pred))
print("F1 : %.3f" % f1_score(y_test, y_pred))
from sklearn.metrics import confusion_matrix
confmat = confusion_matrix(y_true=y_test, y_pred=y_pred)
print(confmat)
3-3) 클래스 불균형 조정
preprocessed_df['Legendary'].value_counts()
1:1 샘플링
positive_random_idx = preprocessed_df[preprocessed_df['Legendary']==1].sample(65, random_state=33).index.tolist()
negative_random_idx = preprocessed_df[preprocessed_df['Legendary']==0].sample(65, random_state=33).index.tolist()
positive_random_idx
데이터셋 분리
random_idx = positive_random_idx + negative_random_idx
X = preprocessed_df.loc[random_idx, preprocessed_df.columns != 'Legendary']
y = preprocessed_df['Legendary'][random_idx]
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
print(x_train.shape)
print(x_test.shape)
모델 재학습
lr = LogisticRegression(random_state=0)
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)
print("accuracy: %.2f" % accuracy_score(y_test, y_pred))
print("Precision : %.3f" % precision_score(y_test, y_pred))
print("Recall : %.3f" % recall_score(y_test, y_pred))
print("F1 : %.3f" % f1_score(y_test, y_pred))
confmat = confusion_matrix(y_true=y_test, y_pred=y_pred)
print(confmat)
패스트캠퍼스 데이터분석 강의 링크
bit.ly/3imy2uN
카테고리 없음