add share buttonsSoftshare button powered by web designing, website development company in India

Use sklearn to do multilinear regression
(by adding a random row)

Data file:Multiple linear regression.csv

Compare with linear regression result:click here

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
sns.set() 
##use sklearn
from sklearn.linear_model import LinearRegression   
In [23]:
data = pd.read_csv('C:\\Users\\Python_practice\\1.02. Multiple linear regression.csv')   
In [24]:
data.describe()   ##The dataset has 84 samples
Out[24]:
 SATGPARand 1,2,3
count84.00000084.00000084.000000
mean1845.2738103.3302382.059524
std104.5306610.2716170.855192
min1634.0000002.4000001.000000
25%1772.0000003.1900001.000000
50%1846.0000003.3800002.000000
75%1934.0000003.5025003.000000
max2050.0000003.8100003.000000
In [25]:
x = data[['SAT','Rand 1,2,3']]
y = data['GPA']
In [26]:
reg = LinearRegression()
reg.fit(x,y)
Out[26]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [27]:
reg.coef_
Out[27]:
array([ 0.00165354, -0.00826982])
In [28]:
reg.intercept_   # We get y = 0.296 + 0.0017*SAT - 0.0083*(Rand 1,2,3)
Out[28]:
0.29603261264909486
In [29]:
reg.score(x,y)    ##This is R-square, not adjusted R-square. Usually we use adjusted one to analyze Multilinear 
Out[29]:
0.4066811952814285
In [30]:
##If We use feature with little explinatory power, the R-square would increase. 
##Thus we need to penalize this excessive usage through the adjusted R-square 
 

Formula for adjusted R^2

R2(adj.)=1(1R2)n1np1R(2adj.)=1−(1−R2)∗n−1n−p−1

In [31]:
x.shape  ##n=84(the number of observations), p=2(the number of predictors)
Out[31]:
(84, 2)
In [32]:
r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
Rsquare_adj = [1 - (1 - r2)*(n-1)/(n-p-1)]
Rsquare_adj
Out[32]:
[0.39203134825134023]
In [33]:
##Conclusion:Adjusted R-square is considerably less than R-square 
##Thus one or more of the predictors have little or no explinatory power
##We need to eliminate those unnecessary features  
##p-value>0.05, disregard it, in sklearn called f_regression
In [34]:
from sklearn.feature_selection import f_regression
In [35]:
f_regression(x,y)   ##second array is p-value
Out[35]:
(array([56.04804786,  0.17558437]), array([7.19951844e-11, 6.76291372e-01]))
In [36]:
p_value = f_regression(x,y)[1]
p_value
Out[36]:
array([7.19951844e-11, 6.76291372e-01])
In [37]:
p_value.round(3)   ##We don't need so many digits, here we take only 3 digits
Out[37]:
array([0.   , 0.676])
 

We find out Rand 1,2,3 is an useless feature

In [38]:
##Make a conclusion table
reg_summary = pd.DataFrame(data = x.columns.values, columns = ['Features'])
reg_summary['Coefficients'] = reg.coef_ 
reg_summary['p-value'] = p_value.round(3)
reg_summary
Out[38]:
 FeaturesCoefficientsp-value
0SAT0.0016540.000
1Rand 1,2,3-0.0082700.676