CSV Data:linear_regression.csv
Compare with multilinear regression result:click here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
## beautify the plot made with matplotlib
import seaborn as sns
sns.set()
data = pd.read_csv('C:\\Users\\Python_practice\\1.01. Simple linear regression.csv')
##read csv file
data.describe() ## Pandas method to give the most useful descriptive statistics in the data frame
| SAT | GPA | |
|---|---|---|
| count | 84.000000 | 84.000000 |
| mean | 1845.273810 | 3.330238 |
| std | 104.530661 | 0.271617 |
| min | 1634.000000 | 2.400000 |
| 25% | 1772.000000 | 3.190000 |
| 50% | 1846.000000 | 3.380000 |
| 75% | 1934.000000 | 3.502500 |
| max | 2050.000000 | 3.810000 |
## Predict GPA via SAT
## Reason: SAT is one of the best estimators of intellectual capacity
## Creat scatter plot of "x1 & y" via "matplotlib.pyplot"
y = data['GPA']
x1 = data['SAT']
plt.scatter(x1,y)
plt.xlabel('SAT',fontsize=20)
plt.ylabel('GPA',fontsize=20)
plt.show()
## use OLS(最小平方法) to plot with "statsmodels.api"
x = sm.add_constant(x1.to_numpy()) ## Creat regression y = x0 + b1x1 via "statsmodels.api", convert x1 to a numpy array
result = sm.OLS(y,x).fit() ## apply a specific estimation tech(OLS) to obtain the "fit" of the model
result.summary()
| Dep. Variable: | GPA | R-squared: | 0.406 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.399 |
| Method: | Least Squares | F-statistic: | 56.05 |
| Date: | Fri, 10 Jan 2020 | Prob (F-statistic): | 7.20e-11 |
| Time: | 17:10:15 | Log-Likelihood: | 12.672 |
| No. Observations: | 84 | AIC: | -21.34 |
| Df Residuals: | 82 | BIC: | -16.48 |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 0.2750 | 0.409 | 0.673 | 0.503 | -0.538 | 1.088 |
| x1 | 0.0017 | 0.000 | 7.487 | 0.000 | 0.001 | 0.002 |
| Omnibus: | 12.839 | Durbin-Watson: | 0.950 |
|---|---|---|---|
| Prob(Omnibus): | 0.002 | Jarque-Bera (JB): | 16.155 |
| Skew: | -0.722 | Prob(JB): | 0.000310 |
| Kurtosis: | 4.590 | Cond. No. | 3.29e+04 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.29e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
## according to the calculation above, we know b1 = 0.0017 and x0 = 0.275
## use matplotlib.pyplot to draw the regression line
plt.scatter(x1,y)
yhat = 0.0017*x1 + 0.275
fig = plt.plot(x1, yhat, lw=4, c='orange', label='regression line')
plt.xlabel('SAT', fontsize = 20)
plt.ylabel('GPA', fontsize = 20)
plt.show()
