add share buttonsSoftshare button powered by web designing, website development company in India

Linear regression between SAT and GPA

CSV Data:linear_regression.csv

Compare with multilinear regression result:click here

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
## beautify the plot made with matplotlib 
import seaborn as sns
sns.set()    
In [9]:
data = pd.read_csv('C:\\Users\\Python_practice\\1.01. Simple linear regression.csv')   
##read csv file
In [10]:
data.describe()   ## Pandas method to give the most useful descriptive statistics in the data frame
Out[10]:
 SATGPA
count84.00000084.000000
mean1845.2738103.330238
std104.5306610.271617
min1634.0000002.400000
25%1772.0000003.190000
50%1846.0000003.380000
75%1934.0000003.502500
max2050.0000003.810000
In [11]:
## Predict GPA via SAT
## Reason: SAT is one of the best estimators of intellectual capacity
## Creat scatter plot of "x1 & y"  via  "matplotlib.pyplot"
y = data['GPA']
x1 = data['SAT']
plt.scatter(x1,y)
plt.xlabel('SAT',fontsize=20)
plt.ylabel('GPA',fontsize=20)
plt.show()
 
In [12]:
## use OLS(最小平方法) to plot with "statsmodels.api"
x = sm.add_constant(x1.to_numpy())  ## Creat regression   y = x0 + b1x1   via  "statsmodels.api", convert x1 to a numpy array
result = sm.OLS(y,x).fit()   ## apply a specific estimation tech(OLS) to obtain the "fit" of the model
result.summary()
Out[12]:
OLS Regression Results
Dep. Variable:GPAR-squared:0.406
Model:OLSAdj. R-squared:0.399
Method:Least SquaresF-statistic:56.05
Date:Fri, 10 Jan 2020Prob (F-statistic):7.20e-11
Time:17:10:15Log-Likelihood:12.672
No. Observations:84AIC:-21.34
Df Residuals:82BIC:-16.48
Df Model:1  
Covariance Type:nonrobust  
 coefstd errtP>|t|[0.0250.975]
const0.27500.4090.6730.503-0.5381.088
x10.00170.0007.4870.0000.0010.002
Omnibus:12.839Durbin-Watson:0.950
Prob(Omnibus):0.002Jarque-Bera (JB):16.155
Skew:-0.722Prob(JB):0.000310
Kurtosis:4.590Cond. No.3.29e+04

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.29e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

In [13]:
## according to the calculation above, we know b1 = 0.0017 and x0 = 0.275
## use matplotlib.pyplot to draw the regression line
plt.scatter(x1,y)
yhat = 0.0017*x1 + 0.275
fig = plt.plot(x1, yhat, lw=4, c='orange', label='regression line')
plt.xlabel('SAT', fontsize = 20) 
plt.ylabel('GPA', fontsize = 20)
plt.show()