Data Analysis & ML/시계열분석

[시계열분석] 시계열 데이터 전처리 실습(Python)(2) - 다중공선성 제거

YSY^ 2021. 3. 8. 16:26

[시계열분석] 시계열 데이터 전처리 방향 - 시간현실 반영, Scaling, 다중공선성 처리 :ysyblog.tistory.com/217

해당 포스팅은 위 포스팅에 이어 진행되는 실습입니다.

데이터 코딩은 아래 포스팅에 이어 진행됩니다.

[시계열분석] 시계열 변수 추출 실습(Python)(1) - 시계열 분해 (bike-sharing-demand dataset) :ysyblog.tistory.com/209
[시계열분석] 시계열 변수 추출 실습(Python)(2) - 이동평균/지연값/증감폭/그룹화 (bike-sharing-demand dataset) :ysyblog.tistory.com/210
[시계열분석] 시계열 변수 추출 실습(Python)(3) - 종속변수들과 독립변수들과의 관계를 파악하기 위한 시각화 (bike-sharing-demand dataset) :ysyblog.tistory.com/211
[시계열분석] 시계열 변수 추출 실습(Python)(4) - 시계열 데이터 준비(train/test set 분리) (bike-sharing-demand dataset) :ysyblog.tistory.com/212
[시계열분석] 기본 모델링 실습(Python) - 모델링 및 분석 성능 평가(bike-sharing-demand dataset) :ysyblog.tistory.com/215
[시계열분석] 시계열 데이터 전처리 실습(Python)(1) - 시간현실반영 및 Scaling : ysyblog.tistory.com/218

 

다중공선성 체크

raw_feR.corr().loc[:, ['casual', 'registered', 'count']].style.background_gradient().set_precision(2).set_properties(**{'font-size': '11pt'})
# count_trend, count_seasonal, count_Day, count_Week, count_diff, Hour, count_lag1, count_lag2 과 같은 변수들이 count와 상관관계가 높다.(registered와 causal은 target변수와 같으니 제외)

for col in raw_feR.describe().columns: #모든 변수들의 자기상관계수를 보는 것
    target = raw_feR[col]
    figure, axes = plt.subplots(2,1,figsize=(16,10))
    sm.graphics.tsa.plot_acf(target, lags=100, use_vlines=True, ax=axes[0], title=col)
    sm.graphics.tsa.plot_pacf(target, lags=100, use_vlines=True, ax=axes[1], title=col) 

season의 자기상관성은 굉장히 높다고 할 수 있다.(편자기상관은 낮다.)

# extract effective features using variance inflation factor
vif = pd.DataFrame()
vif['VIF_Factor'] = [variance_inflation_factor(X_train_feRS.values, i) 
                     for i in range(X_train_feRS.shape[1])]
vif['Feature'] = X_train_feRS.columns
vif.sort_values(by='VIF_Factor', ascending=True) 

코드요약

def feature_engineering_XbyVIF(X_train, num_variables):
    vif = pd.DataFrame()
    vif['VIF_Factor'] = [variance_inflation_factor(X_train.values, i) 
                         for i in range(X_train.shape[1])]
    vif['Feature'] = X_train.columns
    X_colname_vif = vif.sort_values(by='VIF_Factor', ascending=True)['Feature'][:num_variables].values
    return X_colname_vif

 

결과정리

위 VIF factor에서 어느칼럼까지 모델에 반영시켜야할지 알 수가 없다. 따라서 VIF가 작은것부터 하나씩 추가하면서 모델링하며 성능을 체크해본다.

raw_all = pd.read_csv(location)

# Feature Engineering
raw_fe = feature_engineering(raw_all)
### Reality ###
target = ['count_trend', 'count_seasonal', 'count_Day', 'count_Week', 'count_diff']
raw_feR = feature_engineering_year_duplicated(raw_fe, target)
###############

# Data Split
# Confirm of input and output
Y_colname = ['count']
X_remove = ['datetime', 'DateTime', 'temp_group', 'casual', 'registered']
X_colname = [x for x in raw_fe.columns if x not in Y_colname+X_remove]
X_train_feR, X_test_feR, Y_train_feR, Y_test_feR = datasplit_ts(raw_feR, Y_colname, X_colname, '2012-07-01')
### Reality ###
target = ['count_lag1', 'count_lag2']
X_test_feR = feature_engineering_lag_modified(Y_test_feR, X_test_feR, target)
###############
### Scaling ###
X_train_feRS, X_test_feRS = feature_engineering_scaling(preprocessing.Normalizer(), X_train_feR, X_test_feR)
###############

eval_tr = pd.DataFrame()
eval_te = pd.DataFrame()
for i in tqdm(range(1,len(X_train_feRS.columns)+1)):
    X_colname_vif = feature_engineering_XbyVIF(X_train_feRS, i)
#     print('Number_of_Selected_X: ', len(X_colname_vif))
    X_train_feRSM, X_test_feRSM = X_train_feRS[X_colname_vif].copy(), X_test_feRS[X_colname_vif].copy()

    # Applying Base Model
    fit_reg1_feRSM = sm.OLS(Y_train_feR, X_train_feRSM).fit()
    pred_tr_reg1_feRSM = fit_reg1_feRSM.predict(X_train_feRSM).values
    pred_te_reg1_feRSM = fit_reg1_feRSM.predict(X_test_feRSM).values

    # Evaluation
    Score_reg1_feRSM, Resid_tr_reg1_feRSM, Resid_te_reg1_feRSM = evaluation_trte(Y_train_feR, pred_tr_reg1_feRSM,
                                                                       Y_test_feR, pred_te_reg1_feRSM, graph_on=False)
    eval_tr = pd.concat([eval_tr, Score_reg1_feRSM.loc[['Train']]], axis=0)
    eval_te = pd.concat([eval_te, Score_reg1_feRSM.loc[['Test']]], axis=0)
eval_tr.index = range(1,len(X_train_feRS.columns)+1)
eval_te.index = range(1,len(X_train_feRS.columns)+1)

plt.figure(figsize=(12,5))
plt.plot(eval_tr.index, eval_tr/eval_tr.max())
plt.legend(eval_tr.columns)
plt.title('Evaluation of Train Set')
plt.show()

plt.figure(figsize=(12,5))
plt.plot(eval_te.index, eval_te/eval_te.max())
plt.legend(eval_te.columns)
plt.title('Evaluation of Test Set')
plt.show()

13개 이후로 에러율이 낮아지지 않는다. 그렇기에 12개만 써도 된다는 것이며 나머지는 다중공선성이 있다는 것이다.

 

코드수정(VIF가 높은 변수 제거 -> 다중공선성 제거)

  • 12개의 변수만 포함해서 모델링을 실시한다.
raw_all = pd.read_csv(location)

# Feature Engineering
raw_fe = feature_engineering(raw_all)
### Reality ###
target = ['count_trend', 'count_seasonal', 'count_Day', 'count_Week', 'count_diff']
raw_feR = feature_engineering_year_duplicated(raw_fe, target)
###############

# Data Split
# Confirm of input and output
Y_colname = ['count']
X_remove = ['datetime', 'DateTime', 'temp_group', 'casual', 'registered']
X_colname = [x for x in raw_fe.columns if x not in Y_colname+X_remove]
X_train_feR, X_test_feR, Y_train_feR, Y_test_feR = datasplit_ts(raw_feR, Y_colname, X_colname, '2012-07-01')
### Reality ###
target = ['count_lag1', 'count_lag2']
X_test_feR = feature_engineering_lag_modified(Y_test_feR, X_test_feR, target)
###############
### Scaling ###
X_train_feRS, X_test_feRS = feature_engineering_scaling(preprocessing.Normalizer(), X_train_feR, X_test_feR)
###############
### Multicollinearity ### 12개 셀만 사용
print('Number_of_Total_X: ', len(X_train_feRS.columns))
X_colname_vif = feature_engineering_XbyVIF(X_train_feRS, 12)
print('Number_of_Selected_X: ', len(X_colname_vif))
X_train_feRSM, X_test_feRSM = X_train_feRS[X_colname_vif].copy(), X_test_feRS[X_colname_vif].copy()
#########################

# Applying Base Model
fit_reg1_feRSM = sm.OLS(Y_train_feR, X_train_feRSM).fit()
display(fit_reg1_feRSM.summary())
pred_tr_reg1_feRSM = fit_reg1_feRSM.predict(X_train_feRSM).values
pred_te_reg1_feRSM = fit_reg1_feRSM.predict(X_test_feRSM).values

# Evaluation
Score_reg1_feRSM, Resid_tr_reg1_feRSM, Resid_te_reg1_feRSM = evaluation_trte(Y_train_feR, pred_tr_reg1_feRSM,
                                                                   Y_test_feR, pred_te_reg1_feRSM, graph_on=True)
display(Score_reg1_feRSM)

# Error Analysis
error_analysis(Resid_tr_reg1_feRSM, ['Error'], X_train_feRSM, graph_on=True)

 

 

전처리 결과 비교(base, reality, scaling, multicollinearity)

# comparison of precision
display(Score_reg1_rd)
display(Score_reg1_feR)
display(Score_reg1_feRS)
display(Score_reg1_feRSM)

다중공선성을 제거한 마지막 모델이 성능이 가장 좋다.

# comparison of metrics
comparison_r2 = pd.DataFrame([fit_reg1_rd.rsquared_adj, fit_reg1_feR.rsquared_adj, 
                              fit_reg1_feRS.rsquared_adj, fit_reg1_feRSM.rsquared_adj], 
                             index=['rd', 'feR', 'feRS', 'feRSM'], columns=['R^2_adj']).T
comparison_fvalue = pd.DataFrame([fit_reg1_rd.fvalue, fit_reg1_feR.fvalue, 
                                  fit_reg1_feRS.fvalue, fit_reg1_feRSM.fvalue], 
                                 index=['rd', 'feR', 'feRS', 'feRSM'], columns=['F-statistics']).T
comparison_fpvalue = pd.DataFrame([fit_reg1_rd.f_pvalue, fit_reg1_feR.f_pvalue, 
                                   fit_reg1_feRS.f_pvalue, fit_reg1_feRSM.f_pvalue], 
                                  index=['rd', 'feR', 'feRS', 'feRSM'], columns=['prob(F-stat.)']).T
comparison_aic = pd.DataFrame([fit_reg1_rd.aic, fit_reg1_feR.aic, 
                              fit_reg1_feRS.aic, fit_reg1_feRSM.aic], 
                             index=['rd', 'feR', 'feRS', 'feRSM'], columns=['aic']).T
comparison_bic = pd.DataFrame([fit_reg1_rd.bic, fit_reg1_feR.bic, 
                               fit_reg1_feRS.bic, fit_reg1_feRSM.bic], 
                              index=['rd', 'feR', 'feRS', 'feRSM'], columns=['bic']).T
pd.concat([comparison_r2, comparison_fvalue, comparison_fpvalue, comparison_aic, comparison_bic], axis=0)


전처리를 한 모델들이 기본모델에 비해 결정계수도 높고 AIC,BIC값도 더 낮다.

# comparison of coefficients
comparison_fit_rd = pd.concat([pd.DataFrame(fit_reg1_rd.params, columns=['coef']), 
                               pd.DataFrame(fit_reg1_rd.pvalues, columns=['prob(coef)'])], axis=1)
comparison_fit_feR = pd.concat([pd.DataFrame(fit_reg1_feR.params, columns=['coef']), 
                               pd.DataFrame(fit_reg1_feR.pvalues, columns=['prob(coef)'])], axis=1)
comparison_fit_feRS = pd.concat([pd.DataFrame(fit_reg1_feRS.params, columns=['coef']), 
                               pd.DataFrame(fit_reg1_feRS.pvalues, columns=['prob(coef)'])], axis=1)
comparison_fit_feRSM = pd.concat([pd.DataFrame(fit_reg1_feRSM.params, columns=['coef']), 
                               pd.DataFrame(fit_reg1_feRSM.pvalues, columns=['prob(coef)'])], axis=1)
pd.concat([comparison_fit_rd, comparison_fit_feR, comparison_fit_feRS, comparison_fit_feRSM], axis=1)

 

target = Y_train_feR.copy()
stationarity_adf_test(target, 'count')

 

  • p-value가 0.05 미만이므로 정규분포가 아니다.

시간현실반영, 스케일링한 모델, 다중공선성 제거 모델 세가지중 어느 것이 우수한지는 분석가 본인이 검증을 해야한다.

 

해당 포스팅은 패스트캠퍼스의 <파이썬을 활용한 시계열 데이터 분석 A-Z 올인원 패키지> 강의를 듣고 정리한 내용입니다

728x90
반응형