CSV Data:linear_regression.csv
Compare with multilinear regression result:click here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
## beautify the plot made with matplotlib
import seaborn as sns
sns.set()
data = pd.read_csv('C:\\Users\\Python_practice\\1.01. Simple linear regression.csv')
##read csv file
data.describe() ## Pandas method to give the most useful descriptive statistics in the data frame
SAT | GPA | |
---|---|---|
count | 84.000000 | 84.000000 |
mean | 1845.273810 | 3.330238 |
std | 104.530661 | 0.271617 |
min | 1634.000000 | 2.400000 |
25% | 1772.000000 | 3.190000 |
50% | 1846.000000 | 3.380000 |
75% | 1934.000000 | 3.502500 |
max | 2050.000000 | 3.810000 |
## Predict GPA via SAT
## Reason: SAT is one of the best estimators of intellectual capacity
## Creat scatter plot of "x1 & y" via "matplotlib.pyplot"
y = data['GPA']
x1 = data['SAT']
plt.scatter(x1,y)
plt.xlabel('SAT',fontsize=20)
plt.ylabel('GPA',fontsize=20)
plt.show()
## use OLS(最小平方法) to plot with "statsmodels.api"
x = sm.add_constant(x1.to_numpy()) ## Creat regression y = x0 + b1x1 via "statsmodels.api", convert x1 to a numpy array
result = sm.OLS(y,x).fit() ## apply a specific estimation tech(OLS) to obtain the "fit" of the model
result.summary()
Dep. Variable: | GPA | R-squared: | 0.406 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.399 |
Method: | Least Squares | F-statistic: | 56.05 |
Date: | Fri, 10 Jan 2020 | Prob (F-statistic): | 7.20e-11 |
Time: | 17:10:15 | Log-Likelihood: | 12.672 |
No. Observations: | 84 | AIC: | -21.34 |
Df Residuals: | 82 | BIC: | -16.48 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 0.2750 | 0.409 | 0.673 | 0.503 | -0.538 | 1.088 |
x1 | 0.0017 | 0.000 | 7.487 | 0.000 | 0.001 | 0.002 |
Omnibus: | 12.839 | Durbin-Watson: | 0.950 |
---|---|---|---|
Prob(Omnibus): | 0.002 | Jarque-Bera (JB): | 16.155 |
Skew: | -0.722 | Prob(JB): | 0.000310 |
Kurtosis: | 4.590 | Cond. No. | 3.29e+04 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.29e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
## according to the calculation above, we know b1 = 0.0017 and x0 = 0.275
## use matplotlib.pyplot to draw the regression line
plt.scatter(x1,y)
yhat = 0.0017*x1 + 0.275
fig = plt.plot(x1, yhat, lw=4, c='orange', label='regression line')
plt.xlabel('SAT', fontsize = 20)
plt.ylabel('GPA', fontsize = 20)
plt.show()