(updated 12/7/10)
Chapter 10 Notes
©2008 by S. Gramlich
Correlation and
Regression
! = Important Note
! These Notes are not meant to replace
10-2
Correlation (Pearson Product Moment) = measures strength of linear relationship between 2 variables
population parameter is ρ, sample statistic is r
correlation can only be a value between -1 & +1, that is, -1 <= r <= +1
2 ways to inspect: 1) Scatterplot or 2) formula
Scatterplot (Scatter Diagram) = plot (x,y) ordered pairs like in Algebra
horizontal axis represents the independent variable (X-values or explanatory or predictor)
vertical axis represents the dependent variable (Y-values or response or criterion)
if the pattern of scatterplot rises from your left to right (like a + slope) then r will be close to +1
if the pattern of scatterplot falls from your left to right (like a - slope) then r will be close to -1
if there is no visible pattern then r will be close to 0
r = sxy /
(sxsy)
where
sxy =
covariance = SSxy / (n-1) = Σ[(x-xbar)*(y-ybar)] /
(n-1)
SSxy = sum
of cross products = Σ[(x-xbar)*(y-ybar)]
sx = standard deviation for x = √[Σ(x-xbar)2
/ (n-1)]
sy = standard deviation for y = √[Σ(y-ybar)2
/ (n-1)]
! I prefer a variation of the formula on p. 529, whereas the text uses formula 10-1 on page 520
To see if there is significant relationship between 2 variables, perform a correlation HT:
State Hypotheses:
H0: ρ = 0
H1: ρ ≠ 0
! (2 tailed test)
Find cv from Table A-6
r will be used as the ts
Traditional Decision Rule (Compare r to cv):
If r is visually inside critical region, Reject Null (significant relationship)
If r is visually outside critical region, Fail to Reject Null (relationship not significant)
10-3
Regression Line =
the line that best fits through the scatterplot and best minimizes the distance between the observed y values and y values on the
line (least squares property)
Yhat = b0 + b1X {recall Y = mX + b from Algebra}
b1 = slope
and found by b1 = SSxy
/ SSx
where SSxy = sum of cross products =
Σ[(x-xbar)*(y-ybar)]
and SSx = Sum of Squares for X = Σ(x-xbar)2
! I prefer using this formula whereas the text uses formula 10-2 on page 542
b0 = y-intercept (or constant) found by b0 = ybar - b1 * xbar
Only find the Regression line if there is Significant linear relationship between iv and dv from correlation HT above.
The regression line is used to predict values other values for X and Yhat is called the Predicted Value. If the correlation isn't significant then the best predicted value for any X is just the mean for Y (Ybar).
10-4
residual (error) = difference between the
observed Y and Predicted Y (e = y -yhat)
ANOVA (Analysis of
Variance) = analyzes the variance between the observed (y), predicted
(yhat), and mean (ybar)
Sum of Squares Total: SST =
Σ(y-ybar)2
{same SST from chapter 3}
Sum of Squares Regression (Explained or Model): SSreg = Σ(yhat-ybar)2
Sum of Squares Residual (Unexplained or Error): SSres = Σ(y-yhat)2
in regression, SST = SSreg + SSres
Coefficient of Determination = R2 = tells how much of the variance in Y is explained by X
also R2 = SSreg/ SST or just square the correlation
Standard error of estimate = standard deviation of the residuals
se = √[Σ(y-yhat)2 / (n-k-1)]
= SSres / DFres
10-5
Simple Linear
Regression = finding the regression line when there is only 1 Y variable
and 1 X variable
Multiple Regression =
finding the regression line when there is 1 Y variable and more than 1 X
variable
Yhat = b0 + b1X1 + b2X2
+ ....+ bkXk
k = # of predictors (x variables)
b0 = y-intercept (or constant)
b1
= slope of variable X1
b2 = slope of variable X2
bk = slope of variable Xk
The formulas to find the b values are out of the scope of this course, so we use Technology (ie StatCrunch or Excel) instead.
From the Technology output, we can identify the Regression Line.
"Constant" is the y-intercept (b0) and the rest of the Coefficients represent the slopes (b1,b2 ....bk).
A HT must be employed to see if there is an overall significant relationship between all the X variables and Y.
If there is (found by looking at an ANOVA table) below, then proceed with using the Regression line for Prediction.
ANOVA TABLE
SOURCE SS
DF Mean Square (like Variance) F
(test statistic) P-Value
Regression SSreg k MSreg = SSreg/k F = MSreg / MSres
Residual SSres n-k-1 MSreg = SSreg/k
Total SStot n-1
use the
if p-val <= alpha, Reject null (significant relationship)
if p-val > alpha, Fail to Reject null (relationship not significant)
Guidelines for finding the best combination of variables
when comparing different combinations of the predictors look at:
1) lowest P-value (and must be significant)
2) Adjusted R2 (a new R2 adjusted to take into account # of predictors and sample size)
3) use the equation with the fewest amount of variables if adj R2 are same
TECHNOLOGY
using StatCrunch:
! data has to be entered in columns in StatCrunch spreadsheet
for Correlation HT:
Stat - Summary Stats - Correlation - (Select Columns) - Next - (check Display 2 sided P-val from sig test) - Calculate
for Regression HT:
Stat - Regression - Multiple Linear - (select X Variables & Y Variable column) - Calculate
Excel commands:
! highlight the data you want to calculate inside the parentheses
independent variable data set = iv, dependent variable data set = dv
Statistic Excel Command
Correlation =correl(dv,iv)
Slope =slope(dv,iv)
Y-intercept =intercept(dv,iv)
Sum Squares Total =DevSq(dv)
! dv must be entered first
EXCEL Data Analysis Procedure
Tools - Data Analysis - Regression - ok - (highlight & enter data) - ok
! x-variables have to be in adjacent columns
! The Data Analysis add-in must be added in first
! in Excel 2007 the Data Analysis procedures are found the Data menu not the Tools menu