add share buttonsSoftshare button powered by web designing, website development company in India

Add a Dummy variance(attendance) to know
the relationship between SAT and GPA

Data file:Dummies.csv

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
## beautify the plot made with matplotlib 
import seaborn as sns
sns.set()    
In [32]:
raw_data = pd.read_csv('C:\\Users\\Python_practice\\1.03. Dummies.csv')
In [33]:
raw_data
#Attendance means the students attended more than 75% of the lesson
Out[33]:
 SATGPAAttendance
017142.40No
116642.52No
217602.54No
316852.74No
416932.83No
7919363.71Yes
8018103.71Yes
8119873.73No
8219623.76Yes
8320503.81Yes

84 rows × 3 columns

In [34]:
#creat a copy to change yes/no into 0/1, in case we change the raw data 
data = raw_data.copy()
In [35]:
#change Yes/No into 0/1
data['Attendance'] = data['Attendance'].map({'Yes':1, 'No':0})
data.describe()
Out[35]:
 SATGPAAttendance
count84.00000084.00000084.000000
mean1845.2738103.3302380.464286
std104.5306610.2716170.501718
min1634.0000002.4000000.000000
25%1772.0000003.1900000.000000
50%1846.0000003.3800000.000000
75%1934.0000003.5025001.000000
max2050.0000003.8100001.000000
In [36]:
y = data['GPA']
x1 = data[['SAT','Attendance']]
In [37]:
## use OLS(最小平方法) to plot with "statsmodels.api"
x = sm.add_constant(x1.to_numpy())
result = sm.OLS(y,x).fit()
result.summary()
Out[37]:
OLS Regression Results
Dep. Variable:GPAR-squared:0.565
Model:OLSAdj. R-squared:0.555
Method:Least SquaresF-statistic:52.70
Date:Fri, 24 Jan 2020Prob (F-statistic):2.19e-15
Time:15:48:14Log-Likelihood:25.798
No. Observations:84AIC:-45.60
Df Residuals:81BIC:-38.30
Df Model:2  
Covariance Type:nonrobust  
 coefstd errtP>|t|[0.0250.975]
const0.64390.3581.7970.076-0.0691.357
x10.00140.0007.1410.0000.0010.002
x20.22260.0415.4510.0000.1410.304
Omnibus:19.560Durbin-Watson:1.009
Prob(Omnibus):0.000Jarque-Bera (JB):27.189
Skew:-1.028Prob(JB):1.25e-06
Kurtosis:4.881Cond. No.3.35e+04


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.35e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In [38]:
## according to the calculation above, we know y=0.6439+0.0014*SAT+0.2226*Dummy(Attendance)
##Dummy=0, yhat_no = 0.6439 + 0.0014*SAT
##Dummy=1, yhat_yes = 0.8665 + 0.0014*SAT
## use matplotlib.pyplot to draw the regression line

plt.scatter(data['SAT'],y,c=data['Attendance'],cmap='RdYlGn_r')
yhat_no = 0.6439 + 0.0014*data['SAT']
yhat_yes = 0.8665 + 0.0014*data['SAT']
yhat = 0.0017*data['SAT'] + 0.275   #This line is without dummy variance(attendance) 
fig = plt.plot(data['SAT'],yhat_no, lw=2, c='#006837')  #green line
fig = plt.plot(data['SAT'],yhat_yes, lw=2, c='#a50026') #red line
fig = plt.plot(data['SAT'], yhat, lw=2, c='orange', label='regression line') #orange line
plt.xlabel('SAT', fontsize = 20) 
plt.ylabel('GPA', fontsize = 20)
plt.show()