[PYTHON]-(06) 데이터 마이닝[다중회귀분석]

2022-11-10

다중회귀분석

이번 포스팅에서는 단순회귀분석의 확장인 다중회귀분석을 알아보자.

다중회귀분석의 기본적인 개념은 단순회귀분석과 동일하며, 유일한 차이점은 한 개의 독립변수 대신에 두 개 이상을 사용한다는 점이다.

import statsmodels.api as sm
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats
import math

2. 다중회귀분석 예제 : boston

[01] CRIM 자치시(town) 별 1인당 범죄율
[02] ZN 25,000 평방피트를 초과하는 거주지역의 비율
[03] INDUS 비소매상업지역이 점유하고 있는 토지의 비율
[04] CHAS 찰스강에 대한 더미변수(강의 경계에 위치한 경우는 1, 아니면 0)
[05] NOX 10ppm 당 농축 일산화질소
[06] RM 주택 1가구당 평균 방의 개수
[07] AGE 1940년 이전에 건축된 소유주택의 비율
[08] DIS 5개의 보스턴 직업센터까지의 접근성 지수
[09] RAD 방사형 도로까지의 접근성 지수
[10] TAX 10,000 달러 당 재산세율
[11] PTRATIO 자치시(town)별 학생/교사 비율
[12] B 1000(Bk-0.63)^2, 여기서 Bk는 자치시별 흑인의 비율을 말함.
[13] LSTAT 모집단의 하위계층의 비율(%)
[14] MEDV 본인 소유의 주택가격(중앙값) (단위: $1,000), Target

boston = pd.read_csv('https://raw.githubusercontent.com/Sketchjar/MachineLearningHD/main/boston_data.csv')

boston.drop(['Unnamed: 0'], axis=1, inplace=True)
df = boston 

df.info()
df.describe()
df.hist(density=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  Target   506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB





array([[<AxesSubplot:title={'center':'CRIM'}>,
        <AxesSubplot:title={'center':'ZN'}>,
        <AxesSubplot:title={'center':'INDUS'}>,
        <AxesSubplot:title={'center':'CHAS'}>],
       [<AxesSubplot:title={'center':'NOX'}>,
        <AxesSubplot:title={'center':'RM'}>,
        <AxesSubplot:title={'center':'AGE'}>,
        <AxesSubplot:title={'center':'DIS'}>],
       [<AxesSubplot:title={'center':'RAD'}>,
        <AxesSubplot:title={'center':'TAX'}>,
        <AxesSubplot:title={'center':'PTRATIO'}>,
        <AxesSubplot:title={'center':'B'}>],
       [<AxesSubplot:title={'center':'LSTAT'}>,
        <AxesSubplot:title={'center':'Target'}>, <AxesSubplot:>,
        <AxesSubplot:>]], dtype=object)

데이터에 결측값은 존재하지 않고, 데이터의 분포가 치우친 변수들이 다수 존재한다. 이에 X와 y 모두 로그변환해준다.
Target 변수는 약간의 이상치를 제외하고는 크게 문제되지 않는 분포지만 정확성과 해석의 용이성을 위해 로그변환한다.

2.1 EDA 및 전처리

# 수치형 컬럼만 추출
df.select_dtypes([int,float]).columns

# 로그변환
df[df.columns.difference(['CHAS'])] = df[df.columns.difference(['CHAS'])].apply(np.log1p)

# 독립, 종속변수 생성
X = df[df.select_dtypes([int,float]).columns].drop('Target', axis=1)
y = df['Target']

# 변수 관계 확인
grid = sns.pairplot(X)
grid = grid.map_upper(sns.regplot)
grid = grid.map_lower(sns.kdeplot, fill=True)

치우친 변수들이 다소 완화됐고,
몇 변수들의 관계에서 선형성을 보인다고 짐작할 수 있다.

# 히트맵
sns.heatmap(X.corr(), annot=True, fmt='.2f', cmap = sns.color_palette("RdBu", 10, as_cmap=True))

<AxesSubplot:>

높은 상관관계로 문제가 될만한 변수들은 보이지 않는다.

# box plot
plt.figure(figsize=(15, 15))

for idx, col in enumerate(list(X)):
    plt.subplot(4, 4, idx+1)
    sns.boxplot(X[col])

C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

이상치들이 존재하지만 데이터에 대한 도메인이 없고, 변수들의 면면을 봤을 때 상식적으로 불가능한 수는 아니므로 제거하지 않는다.

2.2 회귀모형 적용

# 독립, 종속변수 생성
X = df[df.select_dtypes([int,float]).columns].drop('Target', axis=1)
X = sm.add_constant(X)
y = df['Target']

# 회귀모형 적용
model = sm.OLS(y,X)
fitted = model.fit()
fitted.summary()

OLS Regression Results
Dep. Variable:	Target	R-squared:	0.799
Model:	OLS	Adj. R-squared:	0.793
Method:	Least Squares	F-statistic:	150.0
Date:	Wed, 16 Nov 2022	Prob (F-statistic):	1.09e-161
Time:	09:30:41	Log-Likelihood:	168.22
No. Observations:	506	AIC:	-308.4
Df Residuals:	492	BIC:	-249.3
Df Model:	13
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	6.1629	0.460	13.387	0.000	5.258	7.067
CRIM	-0.1391	0.019	-7.400	0.000	-0.176	-0.102
ZN	-0.0041	0.008	-0.534	0.593	-0.019	0.011
INDUS	-0.0111	0.023	-0.478	0.633	-0.057	0.035
CHAS	0.0894	0.032	2.783	0.006	0.026	0.152
NOX	-0.9453	0.241	-3.928	0.000	-1.418	-0.472
RM	0.3868	0.114	3.399	0.001	0.163	0.610
AGE	0.0272	0.021	1.325	0.186	-0.013	0.068
DIS	-0.2493	0.042	-5.932	0.000	-0.332	-0.167
RAD	0.1717	0.024	7.184	0.000	0.125	0.219
TAX	-0.1311	0.043	-3.064	0.002	-0.215	-0.047
PTRATIO	-0.6046	0.090	-6.731	0.000	-0.781	-0.428
B	0.0409	0.012	3.375	0.001	0.017	0.065
LSTAT	-0.4266	0.026	-16.512	0.000	-0.477	-0.376

Omnibus:	39.017	Durbin-Watson:	1.065
Prob(Omnibus):	0.000	Jarque-Bera (JB):	163.146
Skew:	0.119	Prob(JB):	3.74e-36
Kurtosis:	5.772	Cond. No.	665.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

모델의 적합도는 결정계수 0.996으로 과적합일 정도로 높다.
ZN, INDUS, AGE 계수는 유의수준 0.05에서 유의하지않다. 제거 유무는 통계적으로 제거하는 것이 맞으나 비즈니스적으로는 고민해봐야한다. 예제에서는 제거하고 진행하겠다.

# 유의하지 않은 변수 제거 후회귀모형 적용
model = sm.OLS(y,X.drop(['ZN', 'INDUS', 'AGE'], axis=1))
fitted = model.fit()
fitted.summary()

OLS Regression Results
Dep. Variable:	Target	R-squared:	0.798
Model:	OLS	Adj. R-squared:	0.793
Method:	Least Squares	F-statistic:	194.9
Date:	Wed, 16 Nov 2022	Prob (F-statistic):	1.44e-164
Time:	09:32:49	Log-Likelihood:	166.97
No. Observations:	506	AIC:	-311.9
Df Residuals:	495	BIC:	-265.5
Df Model:	10
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	6.1758	0.452	13.656	0.000	5.287	7.064
CRIM	-0.1406	0.019	-7.573	0.000	-0.177	-0.104
CHAS	0.0905	0.032	2.855	0.004	0.028	0.153
NOX	-0.9063	0.231	-3.931	0.000	-1.359	-0.453
RM	0.4211	0.111	3.807	0.000	0.204	0.638
DIS	-0.2698	0.036	-7.567	0.000	-0.340	-0.200
RAD	0.1723	0.024	7.214	0.000	0.125	0.219
TAX	-0.1424	0.041	-3.469	0.001	-0.223	-0.062
PTRATIO	-0.5873	0.080	-7.301	0.000	-0.745	-0.429
B	0.0415	0.012	3.432	0.001	0.018	0.065
LSTAT	-0.4156	0.024	-17.353	0.000	-0.463	-0.369

Omnibus:	38.516	Durbin-Watson:	1.049
Prob(Omnibus):	0.000	Jarque-Bera (JB):	161.816
Skew:	0.098	Prob(JB):	7.28e-36
Kurtosis:	5.763	Cond. No.	586.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

계수들의 문제가 없고, Adj-R square와 R square의 차이가 크지 않은 것으로 보아 변수의 개수도 적절하다고 판단할 수 있다.
하단 Notes에 다중공선성 등 별다른 에러가 없는 것으로 보아 모델을 확정한다.

2.3 다중공선성 확인

# 다중공선성
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = X.drop(['ZN', 'INDUS', 'AGE'], axis=1)

vif = pd.DataFrame()
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['X'] = X.columns
vif.sort_values(by='VIF', ascending=False)

	VIF	X
0	3345.180580	const
1	5.889274	CRIM
6	5.262951	RAD
3	4.616675	NOX
7	4.297266	TAX
5	3.546188	DIS
10	2.721180	LSTAT
4	1.860059	RM
8	1.468802	PTRATIO
9	1.273420	B
2	1.059002	CHAS

계수들이 이미 유의하여 다중공선성이 꼭 만족할 필요는 없지만, VIF가 10을 넘기는 변수들이 없음으로써 다중공선성 문제도 없다.

2.4 결과

다중회귀로 보스턴의 주택가격을 분석한 결과 주택가격에 영향을 미치는 변수는 CRIM, ZN, CHAS, NOX, RM, DIS, RAD, TAX, PTRATIO, B, LSTAT이었다.

계수의 해석은 RM을 예로들었을 때, RM의 1% 상승은 보스턴 주택가격의 0.42% 상승을 의미한다.

이와 같은 변수의 계수들로 이루어진 모형의 설명력은 0.799 이다.

Share on

Twitter Facebook LinkedIn

[PYTHON]-(06) 데이터 마이닝[다중회귀분석]

다중회귀분석

2. 다중회귀분석 예제 : boston

2.1 EDA 및 전처리

2.2 회귀모형 적용

2.3 다중공선성 확인

2.4 결과

Share on

Leave a comment

You may also enjoy

Python : datetime

시각화 : Figure 창마다 5개씩 그리기

혼잡도(Perplexity)

PCA와 t-SNE 비교