Data file:Multiple linear regression.csv
Compare with linear regression result:click here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
sns.set()
##use sklearn
from sklearn.linear_model import LinearRegression
data = pd.read_csv('C:\\Users\\Python_practice\\1.02. Multiple linear regression.csv')
data.describe() ##The dataset has 84 samples
SAT | GPA | Rand 1,2,3 | |
---|---|---|---|
count | 84.000000 | 84.000000 | 84.000000 |
mean | 1845.273810 | 3.330238 | 2.059524 |
std | 104.530661 | 0.271617 | 0.855192 |
min | 1634.000000 | 2.400000 | 1.000000 |
25% | 1772.000000 | 3.190000 | 1.000000 |
50% | 1846.000000 | 3.380000 | 2.000000 |
75% | 1934.000000 | 3.502500 | 3.000000 |
max | 2050.000000 | 3.810000 | 3.000000 |
x = data[['SAT','Rand 1,2,3']]
y = data['GPA']
reg = LinearRegression()
reg.fit(x,y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
reg.coef_
array([ 0.00165354, -0.00826982])
reg.intercept_ # We get y = 0.296 + 0.0017*SAT - 0.0083*(Rand 1,2,3)
0.29603261264909486
reg.score(x,y) ##This is R-square, not adjusted R-square. Usually we use adjusted one to analyze Multilinear
0.4066811952814285
##If We use feature with little explinatory power, the R-square would increase.
##Thus we need to penalize this excessive usage through the adjusted R-square
R2(adj.)=1−(1−R2)∗n−1n−p−1R(2adj.)=1−(1−R2)∗n−1n−p−1
x.shape ##n=84(the number of observations), p=2(the number of predictors)
(84, 2)
r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
Rsquare_adj = [1 - (1 - r2)*(n-1)/(n-p-1)]
Rsquare_adj
[0.39203134825134023]
##Conclusion:Adjusted R-square is considerably less than R-square
##Thus one or more of the predictors have little or no explinatory power
##We need to eliminate those unnecessary features
##p-value>0.05, disregard it, in sklearn called f_regression
from sklearn.feature_selection import f_regression
f_regression(x,y) ##second array is p-value
(array([56.04804786, 0.17558437]), array([7.19951844e-11, 6.76291372e-01]))
p_value = f_regression(x,y)[1]
p_value
array([7.19951844e-11, 6.76291372e-01])
p_value.round(3) ##We don't need so many digits, here we take only 3 digits
array([0. , 0.676])
##Make a conclusion table
reg_summary = pd.DataFrame(data = x.columns.values, columns = ['Features'])
reg_summary['Coefficients'] = reg.coef_
reg_summary['p-value'] = p_value.round(3)
reg_summary
Features | Coefficients | p-value | |
---|---|---|---|
0 | SAT | 0.001654 | 0.000 |
1 | Rand 1,2,3 | -0.008270 | 0.676 |