Exploring bivariate numerical data

Correlation coefficient r

The correlation coefficient $r$ measures the direction and strength of a linear relationship. Calculating $r$ is pretty complex, so we usually rely on technology for the computations. We focus on understanding what $r$ says about a scatterplot.Here are some facts about $r$ :

It always has a value between $-1$ and $1$ .
Strong positive linear relationships have values of $r$ closer to $1$ .
Strong negative linear relationships have values of $r$ closer to $-1$
Weaker relationships have values of $r$ closer to $0$ .

$\large\bf r = \frac{1}{n-1}\sum (\frac{x_i - \bar{x} }{s_x})(\frac{y_i - \bar{y}}{s_y})$

where $(\frac{x_i - \bar{x} }{s_x})$ stands for the z-score for each x values and $(\frac{y_i - \bar{y}}{s_y})$ stands for the z-score for each y values.

Example:

$(1, 1)(2, 2)(2, 3)(3, 6) \newline \bar x = 2 \newline s_x = 0.816 \newline \bar y = 3 \newline s_y = 2.160$

$r = \frac{1}{3} ((\frac{1-2}{0.816})(\frac{1-3}{2.160})...)\newline r=\frac{1}{3}(\frac{5}{0.816 * 2.160})\newline r \approx 0.946$

Residuals and least squares regression

$\sum (r_n)^2$

calculate $r$ as the difference from the points $y$ -value and the lines $y$ -value at a given $x$ -value.

For example, the residual for the point $(4,3)$ is $\redD{-2}$ :

The closer a data point's residual is to 0, the better the fit. In this case, the line fits the point $(4,3)$ better than it fits the point $(2,8)$ .

residual = actual - predicted

$\hat{y} = \frac{1}{3}+\frac{1}{3}x$ (given equation)

residual = $51 - (\frac{1}{3}+\frac{1}{3}(155)) \Rightarrow 51 - (\frac{1}{3}+\frac{155}{3}) \Rightarrow 51 - 52 = -1$

Example:

$\hat{y}=mx+ b$

$m = r\frac{s_y}{s_x} \Longrightarrow r\frac{\triangle y}{\triangle x}$

$(1, 1)(2, 2)(2, 3)(3, 6) \newline \bar x = 2 \newline s_x = 0.816 \newline \bar y = 3 \newline s_y = 2.160$

$m=0.946 * \frac{2.160}{0.816}\newline \approx2.5\newline 3 = 2.5*2-b\newline b = -2\newline \hat y = 2.5\ \ \ x=-2$

Quiz

A limnologist takes samples from a creek on several days and counts the numbers of flatworms in each sample. The limnologist wants to look at the relationship between the temperature of the creek and the number of flatworms in the sample. The data show a linear pattern with the summary statistics shown below:

mean

standard deviation

$x=$ creek temperature $(^\circ\text{C})$

$\bar{x}=10.2x$

$s_x=2.8$

$y=$ number of flatworms

$\bar{y}=37.6y$

$s_y=30.8$

$r=-0.98$

Find the equation of the least-squares regression line for predicting the number of flatworms from the creek temperature.

Least-squares regression equation

The equation for the least-squares regression line for predicting $y$ from $x$ is of the form:

Finding the slope

We can determine the slope as follows: $\newline b=r(\frac{s_y}{s_x})$

In our case, $\newline b=-0.98(\frac{30.8}{2.8})=-10.78$

Finding the y-intercept

Because the regression line passes through the point $(\bar x, \bar y)$ , we can find the y-intercept as follows:

$\newline a=\bar y-b\bar x \newline$

In our case, $a=37.6 - (-10.78*10.2) =147.556$

Answer

$\hat y=147.56 -10.78 x$

Residual

Joe sells used cars. He recorded the age (in years) of each car on his lot along with the number of kilometers it had been driven. After plotting his results, Joe noticed that the relationship between the two variables was fairly linear, so he used the data to calculate the following least squares regression equation for predicting distance driven from the age of the car:

$\hat{y} = 11{,}000+18{,}000x$

What is the residual of a car that is 2 years old and has been driven $50{,}000\text{ km}$ ?

What is a residual?
Residuals are errors. More specifically, they are the differences between the observed value of the response variable and the value predicted by the least squares regression line.
$\fbox{\text{residual}=(\text{observed }y)-(\text{predicted }y)}$
or
Calculating the predicted value
We can predict the distance driven for a 2 year old car using the least squares regression line like this:
$\hat y = 11{,}000+18{,}000(2)$
Calculating the residual
$\text{residual}=\text{observed}-\text{predicted}\newline \text{residual}=50{,}000\text{ km}-47{,}000\text{ km} \newline \text{residual}=3{,}000\text{ km}$

PreviousModeling data distribution NextStudy Design

Last updated 6 years ago