# Exploring bivariate numerical data

## Correlation coefficient r

The correlation coefficient $$r$$ measures the direction and strength of a linear relationship. Calculating $$r$$ is pretty complex, so we usually rely on technology for the computations. We focus on understanding what $$r$$ says about a scatterplot.Here are some facts about $$r$$:

* It always has a value between $$-1$$ and $$1$$.
* Strong positive linear relationships have values of $$r$$ closer to $$1$$.
* Strong negative linear relationships have values of $$r$$ closer to $$-1$$
* Weaker relationships have values of $$r$$ closer to $$0$$.

$$\large\bf r = \frac{1}{n-1}\sum (\frac{x\_i - \bar{x} }{s\_x})(\frac{y\_i - \bar{y}}{s\_y})$$

where $$(\frac{x\_i - \bar{x} }{s\_x})$$stands for the z-score for each x values and $$(\frac{y\_i - \bar{y}}{s\_y})$$stands for the z-score for each y values.

**Example:**

$$(1, 1)(2, 2)(2, 3)(3, 6) \newline  \bar x = 2 \newline s\_x = 0.816 \newline \bar y = 3 \newline s\_y = 2.160$$

$$r = \frac{1}{3} ((\frac{1-2}{0.816})(\frac{1-3}{2.160})...)\newline  r=\frac{1}{3}(\frac{5}{0.816 \* 2.160})\newline r \approx 0.946$$

## Residuals and least squares regression

$$\sum (r\_n)^2$$

calculate $$r$$ as the difference from the points $$y$$-value and the lines $$y$$-value at a given $$x$$-value.

For example, the residual for the point $$(4,3)$$ is $$\redD{-2}$$:![](https://ka-perseus-graphie.s3.amazonaws.com/cd2dad117468d28f29d17427ca5240f8ca6ccb76.svg)

The closer a data point's residual is to 0, the better the fit. In this case, the line fits the point $$(4,3)$$better than it fits the point $$(2,8)$$.

residual = actual - predicted

$$\hat{y} = \frac{1}{3}+\frac{1}{3}x$$ (given equation)

residual = $$51 - (\frac{1}{3}+\frac{1}{3}(155)) \Rightarrow 51 - (\frac{1}{3}+\frac{155}{3}) \Rightarrow 51 - 52 = -1$$

**Example:**

$$\hat{y}=mx+ b$$

$$m = r\frac{s\_y}{s\_x} \Longrightarrow r\frac{\triangle y}{\triangle x}$$

$$(1, 1)(2, 2)(2, 3)(3, 6) \newline  \bar x = 2 \newline s\_x = 0.816 \newline \bar y = 3 \newline s\_y = 2.160$$

$$m=0.946 \* \frac{2.160}{0.816}\newline \approx2.5\newline 3 = 2.5\*2-b\newline b = -2\newline \hat y = 2.5\ \ \ x=-2$$

## Quiz

A limnologist takes samples from a creek on several days and counts the numbers of flatworms in each sample. The limnologist wants to look at the relationship between the temperature of the creek and the number of flatworms in the sample. The data show a linear pattern with the summary statistics shown below:

|                                               | mean              | standard deviation |
| --------------------------------------------- | ----------------- | ------------------ |
| $$x=$$ creek temperature $$(^\circ\text{C})$$ | $$\bar{x}=10.2x$$ | $$s\_x=2.8$$       |
| $$y=$$ number of flatworms                    | $$\bar{y}=37.6y$$ | $$s\_y=30.8$$      |
|                                               |                   | $$r=-0.98$$        |

**Find the equation of the least-squares regression line for predicting the number of flatworms from the creek temperature.**

### Least-squares regression equation

The equation for the least-squares regression line for predicting $$y$$ from $$x$$ is of the form: $$\newline\huge\bf\newline\hat\red{y}\red=\red a \red + \red b\red x$$

### Finding the slope

We can determine the slope as follows:$$\newline b=r(\frac{s\_y}{s\_x})$$

In our case, $$\newline b=-0.98(\frac{30.8}{2.8})=-10.78$$

### Finding the y-intercept

Because the regression line passes through the point $$(\bar x, \bar y)$$, we can find the y-intercept as follows:&#x20;

$$\newline a=\bar y-b\bar x \newline$$

In our case, $$a=37.6 - (-10.78\*10.2) =147.556$$

### Answer

$$\hat y=147.56 -10.78 x$$

## Residual

Joe sells used cars. He recorded the age (in years) of each car on his lot along with the number of kilometers it had been driven. After plotting his results, Joe noticed that the relationship between the two variables was fairly linear, so he used the data to calculate the following **least squares regression equation** for predicting distance driven from the age of the car:

$$\hat{y} = 11{,}000+18{,}000x$$

**What is the residual of a car that is 2 years old and has been driven** $$50{,}000\text{ km}$$**?**

1. **What is a residual?**

   Residuals are errors. More specifically, they are the differences between the observed value of the response variable and the value predicted by the least squares regression line.

   $$\fbox{\text{residual}=(\text{observed }y)-(\text{predicted }y)}$$

   or

   $$\fbox{  \text{residual}=y-\hat y}$$
2. **Calculating the predicted value**

   We can predict the distance driven for a 2 year old car using the least squares regression line like this:

   $$\hat y = 11{,}000+18{,}000(2)$$
3. **Calculating the residual**

   $$\text{residual}=\text{observed}-\text{predicted}\newline  \text{residual}=50{,}000\text{ km}-47{,}000\text{ km} \newline  \text{residual}=3{,}000\text{ km}$$​<br>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://stephanosterburg.gitbook.io/scrapbook/math/statistics-and-probability/untitled.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
