Exploring bivariate numerical data

Correlation coefficient r

The correlation coefficient rr measures the direction and strength of a linear relationship. Calculating rr is pretty complex, so we usually rely on technology for the computations. We focus on understanding what rr says about a scatterplot.Here are some facts about rr:

  • It always has a value between 1-1 and 11.

  • Strong positive linear relationships have values of rr closer to 11.

  • Strong negative linear relationships have values of rr closer to 1-1

  • Weaker relationships have values of rr closer to 00.

r=1n1(xixˉsx)(yiyˉsy) \large\bf r = \frac{1}{n-1}\sum (\frac{x_i - \bar{x} }{s_x})(\frac{y_i - \bar{y}}{s_y})

where (xixˉsx)(\frac{x_i - \bar{x} }{s_x})stands for the z-score for each x values and (yiyˉsy)(\frac{y_i - \bar{y}}{s_y})stands for the z-score for each y values.

Example:

(1,1)(2,2)(2,3)(3,6)xˉ=2sx=0.816yˉ=3sy=2.160 (1, 1)(2, 2)(2, 3)(3, 6) \newline \bar x = 2 \newline s_x = 0.816 \newline \bar y = 3 \newline s_y = 2.160

r=13((120.816)(132.160)...)r=13(50.8162.160)r0.946 r = \frac{1}{3} ((\frac{1-2}{0.816})(\frac{1-3}{2.160})...)\newline r=\frac{1}{3}(\frac{5}{0.816 * 2.160})\newline r \approx 0.946

Residuals and least squares regression

(rn)2 \sum (r_n)^2

calculate rr as the difference from the points yy-value and the lines yy-value at a given xx-value.

For example, the residual for the point (4,3)(4,3) is 2\redD{-2}:

The closer a data point's residual is to 0, the better the fit. In this case, the line fits the point (4,3)(4,3)better than it fits the point (2,8)(2,8).

residual = actual - predicted

y^=13+13x\hat{y} = \frac{1}{3}+\frac{1}{3}x (given equation)

residual = 51(13+13(155))51(13+1553)5152=151 - (\frac{1}{3}+\frac{1}{3}(155)) \Rightarrow 51 - (\frac{1}{3}+\frac{155}{3}) \Rightarrow 51 - 52 = -1

Example:

y^=mx+b\hat{y}=mx+ b

m=rsysxryx m = r\frac{s_y}{s_x} \Longrightarrow r\frac{\triangle y}{\triangle x}

(1,1)(2,2)(2,3)(3,6)xˉ=2sx=0.816yˉ=3sy=2.160 (1, 1)(2, 2)(2, 3)(3, 6) \newline \bar x = 2 \newline s_x = 0.816 \newline \bar y = 3 \newline s_y = 2.160

m=0.9462.1600.8162.53=2.52bb=2y^=2.5   x=2m=0.946 * \frac{2.160}{0.816}\newline \approx2.5\newline 3 = 2.5*2-b\newline b = -2\newline \hat y = 2.5\ \ \ x=-2

Quiz

A limnologist takes samples from a creek on several days and counts the numbers of flatworms in each sample. The limnologist wants to look at the relationship between the temperature of the creek and the number of flatworms in the sample. The data show a linear pattern with the summary statistics shown below:

mean

standard deviation

x=x= creek temperature (C)(^\circ\text{C})

xˉ=10.2x\bar{x}=10.2x

sx=2.8s_x=2.8

y=y= number of flatworms

yˉ=37.6y\bar{y}=37.6y

sy=30.8s_y=30.8

r=0.98r=-0.98

Find the equation of the least-squares regression line for predicting the number of flatworms from the creek temperature.

Least-squares regression equation

The equation for the least-squares regression line for predicting yy from xx is of the form:

Finding the slope

We can determine the slope as follows:b=r(sysx)\newline b=r(\frac{s_y}{s_x})

In our case, b=0.98(30.82.8)=10.78\newline b=-0.98(\frac{30.8}{2.8})=-10.78

Finding the y-intercept

Because the regression line passes through the point (xˉ,yˉ)(\bar x, \bar y), we can find the y-intercept as follows:

a=yˉbxˉ\newline a=\bar y-b\bar x \newline

In our case, a=37.6(10.7810.2)=147.556 a=37.6 - (-10.78*10.2) =147.556

Answer

y^=147.5610.78x\hat y=147.56 -10.78 x

Residual

Joe sells used cars. He recorded the age (in years) of each car on his lot along with the number of kilometers it had been driven. After plotting his results, Joe noticed that the relationship between the two variables was fairly linear, so he used the data to calculate the following least squares regression equation for predicting distance driven from the age of the car:

y^=11,000+18,000x\hat{y} = 11{,}000+18{,}000x

What is the residual of a car that is 2 years old and has been driven 50,000 km50{,}000\text{ km}?

  1. What is a residual?

    Residuals are errors. More specifically, they are the differences between the observed value of the response variable and the value predicted by the least squares regression line.

    residual=(observed y)-(predicted y)\fbox{\text{residual}=(\text{observed }y)-(\text{predicted }y)}

    or

  2. Calculating the predicted value

    We can predict the distance driven for a 2 year old car using the least squares regression line like this:

    y^=11,000+18,000(2)\hat y = 11{,}000+18{,}000(2)

  3. Calculating the residual

    residual=observedpredictedresidual=50,000 km47,000 kmresidual=3,000 km\text{residual}=\text{observed}-\text{predicted}\newline \text{residual}=50{,}000\text{ km}-47{,}000\text{ km} \newline \text{residual}=3{,}000\text{ km}

Last updated