{{ 'ml-label-loading-course' | message }}

{{ tocSubheader }}

{{ 'ml-toc-proceed-mlc' | message }}

{{ 'ml-toc-proceed-tbs' | message }}

An error ocurred, try again later!

Chapter {{ article.chapter.number }}

{{ article.number }}. # {{ article.displayTitle }}

{{ article.intro.summary }}

Show less Show more Lesson Settings & Tools

| {{ 'ml-lesson-number-slides' | message : article.intro.bblockCount }} |

| {{ 'ml-lesson-number-exercises' | message : article.intro.exerciseCount }} |

| {{ 'ml-lesson-time-estimation' | message }} |

Looking for a correlation between data sets can help to determine if different phenomena are related. If a correlation exists, it is important to be able to measure it. This lesson will explore one of the quantities that measure the strength of a correlation. Also, it will investigate whether or not one data set is directly affected by another data set.
### Catch-Up and Review

**Here are a few recommended readings before getting started with this lesson.**

Explore

Consider the following four data sets called the Anscombe's quartet. Notice that the first three sets have the same $x-$values. Explore the following applet's options to investigate and compare the characteristics of each set.

The four sets have the same mean, standard deviation, and the same line of regression. Therefore, it seems reasonable to think that the scatter plots of these sets are similar. Use the following graph to discover each scatter plot and be ready for a surprise.

In noticing the contrast between what the statistical characteristics suggest and what the diagrams show, what conclusion can be drawn about analyzing data sets, in general?

Discussion

When finding the line of best fit of a data set using a graphing calculator, it usually gives a constant $r$ which measures the strength of the correlation. This constant is called the *correlation coefficient*.

Concept

The correlation coefficient, usually denoted by $r,$ measures the direction and strength of a linear relationship between two variables. It can take on values between $-1$ and $1.$ Values near $-1$ mean that the correlation is strong and negative, while values close to $1$ are strong and positive. Values close to $0$ represent a weak or very weak correlation, but $r=0$ represents no correlation.
*only* when a linear model describes the data well. In addition to understanding the meaning of the correlation coefficient, while a graphing calculator is able to find the line of best fit, it is beneficial to learn how to calculate it by hand. Consider the following formula. ### Extra

Formula for Finding the Pearson Correlation Coefficient

When there is a linear model that describes the relationship between two variables well, the correlation coefficient indicates how close the points are to the line of best fit. The closer the value $r$ is to $-1$ or $1,$ the closer the points to the line of best fit.

Keep in mind that the correlation coefficient is useful

Given a data set with $n$ points ${(x_{1},y_{1}),…,(x_{n},y_{n})},$ the *Pearson correlation coefficient* can be found by dividing the covariance of $x$ and $y$ by the product of their standard deviations.

$r=σ(x)σ(y)Cov(x,y) $

Alternatively, the formula can be rewritten as follows.
Although there are different types of correlation coefficients, the most commonly used is the Pearson correlation coefficient.

Knowing the correlation coefficient can give a clue about how well the regression line models the data. Nevertheless, it is always best practice to make a scatter plot before drawing any conclusions. As an example, each set in the Anscombe's quartet has a correlation coefficient of $0.816,$ implying a strong positive correlation.

However, a linear model does not seem to fit the second set of data well.Example

Pair each scatter plot with the most appropriate correlation coefficient.

{"type":"pair","form":{"alts":[[{"id":0,"text":"Scatter Plot A"},{"id":1,"text":"Scatter Plot B"},{"id":2,"text":"Scatter Plot C"},{"id":3,"text":"Scatter Plot D"}],[{"id":0,"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.43056em;vertical-align:0em;\"><\/span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">r<\/span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"><\/span><span class=\"mrel\">=<\/span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"><\/span><span class=\"mord\">0<\/span><span class=\"mord\">.<\/span><span class=\"mord\">0<\/span><span class=\"mord\">8<\/span><\/span><\/span><\/span>"},{"id":1,"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.43056em;vertical-align:0em;\"><\/span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">r<\/span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"><\/span><span class=\"mrel\">=<\/span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"><\/span><span class=\"mord text\"><span class=\"mord\">-<\/span><\/span><span class=\"mord\">0<\/span><span class=\"mord\">.<\/span><span class=\"mord\">6<\/span><span class=\"mord\">5<\/span><\/span><\/span><\/span>"},{"id":2,"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.43056em;vertical-align:0em;\"><\/span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">r<\/span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"><\/span><span class=\"mrel\">=<\/span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"><\/span><span class=\"mord\">0<\/span><span class=\"mord\">.<\/span><span class=\"mord\">8<\/span><span class=\"mord\">4<\/span><\/span><\/span><\/span>"},{"id":3,"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.43056em;vertical-align:0em;\"><\/span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">r<\/span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"><\/span><span class=\"mrel\">=<\/span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"><\/span><span class=\"mord text\"><span class=\"mord\">-<\/span><\/span><span class=\"mord\">0<\/span><span class=\"mord\">.<\/span><span class=\"mord\">9<\/span><span class=\"mord\">2<\/span><\/span><\/span><\/span>"}]],"lockLeft":true,"lockRight":false},"formTextBefore":"","formTextAfter":"","answer":[[0,1,2,3],[0,1,2,3]]}

Try drawing a line that fits each data set. If the points are grouped near the line, the correlation coefficient should be close either to $1$ or $-1.$ If the slope of the line is positive, the correlation coefficient is positive, and if the slope of the line is negative, the correlation coefficient is negative.

To pair each scatter plot with its corresponding correlation coefficient, analyze each scatter plot one at a time.

Therefore, there is no correlation between the points of Scatter Plot $A.$ That implies the correlation coefficient is close to $0.$ Considering the given options, only $r=0.08$ meets this condition.

$Scatter PlotA⟶r=0.08$

In Scatter Plot $B,$ the points seem to follow a certain direction. As $x$ increases, $y$ tends to decrease which implies a negative correlation. Here, a line with a negative slope can be drawn to model the data.

Hence, the correlation coefficient must also be negative. Considering the given options, there are two possible values for $r:$ $-0.92$ or $-0.65.$ At this time, however, it cannot be concluded which corresponds to Scatter Plot $B.$

$Scatter PlotB ↗↘ r=-0.92orr=-0.65 $

The points in Scatter Plot $C$ follow an ascending trajectory — as $x$ increases, $y$ tends to increase as well. This relationship implies a positive correlation. Here, a line with a positive slope seems to be a good candidate for the line of best fit.

Therefore, the correlation coefficient is positive. Likewise, since all the points are close to the line, the correlation coefficient must be close to $1.$ Consequently, the best choice for the correlation coefficient of Scatter Plot $C$ is $r=0.84.$

$Scatter PlotC⟶r=0.84$

In Scatter Plot $D,$ the points follow a descending direction — as $x$ increases, $y$ tends to decrease. This relationship implies a negative correlation. Like Scatter Plot $B,$ a line with negative slope is a good candidate to be the line of best fit.

Therefore, the correlation coefficient is negative. As before, there are two options.

$Scatter PlotD ↗↘ r=-0.92orr=-0.65 $

When comparing Scatter Plots $B$ and $D,$ the points on $D$ are more clustered and closer to following a linear pattern than those on scatter plot $B.$ Thus, the correlation in scatter plot $D$ is stronger than the correlation in scatter plot $B.$ Consequently, $r=-0.92$ is the most appropriate correlation coefficient for scatter plot $D.$

$Scatter PlotD⟶r=-0.92Scatter PlotB⟶r=-0.65 $

In the following table, each scatter plot is paired with its correlation coefficient, and the strength of the correlation is described.

Scatter Plot | Correlation Coefficient | Strength |
---|---|---|

$A$ | $0.08$ | Very weak positive correlation |

$B$ | $-0.65$ | Strong negative correlation |

$C$ | $0.84$ | Very strong positive correlation |

$D$ | $-0.92$ | Very strong negative correlation |

Pop Quiz

For each data set, find the correlation coefficient. Give the answer rounded to two decimal places.

Example

Maya works at an eSports company that has offered a customized gaming chair for each player on the Los Angeles Lakers $2020–2021$ season roster. The measurements of each player are listed in the following table.

Height (in) | $72.83$ | $74.02$ | $75.20$ | $75.20$ | $75.20$ | $75.98$ | $75.98$ | $75.98$ | $75.98$ | $77.17$ | $77.17$ | $77.17$ | $79.13$ | $79.92$ | $81.89$ | $81.98$ | $83.07$ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Weight (lbs) | $178.57$ | $189.60$ | $182.98$ | $198.42$ | $198.42$ | $194.01$ | $205.03$ | $233.69$ | $218.26$ | $178.57$ | $213.85$ | $196.21$ | $235.90$ | $213.85$ | $251.33$ | $264.56$ | $264.56$ |

a Find the line of best fit and the correlation coefficient — rounded to three decimal places — for the given data. Then, classify its strength.

b Maya noticed that LeBron James' measurements are missing. She gave him a call and asked for his height and weight. Oh no! Before he could say his weight, the phone reception was lost! Unable to reach LeBron again, she decided to approximate his weight. If LeBron is $81.10$ inches tall, approximately, what is Lebron's weight? Is it a good prediction?

a **Line of Best Fit:** $y=8.160x−417.812$

**Correlation Coefficient:** $r=0.848$

**Strength:** Very strong positive correlation

b **Approximated Weight:** $243.964$ pounds

**Is it a good prediction?** Yes, by analyzing the scatter plot, a linear model seems to describe the data good enough. Also, the correlation coefficient is close to $1.$

a Use a graphing calculator to find the line of best fit and the correlation coefficient. When the value of $∣r∣$ is close to $1,$ the correlation is strong.

b To determine LeBron's weight, use the line of best fit to predict. To determine whether the prediction is good, make a scatter plot of the data. The prediction is good if a linear model describes the data well enough and the correlation is strong.

a The line of best fit and the correlation coefficient can be found using a graphing calculator. First, the data values need to be entered into the calculator. This is done by pressing the $STAT $ button and selecting the option

Edit.

Finally, by pressing the $STAT $ button, and then selecting the menu item **CALC**, the option LinReg($ax+b$)

can be found. This option gives a line of best fit, expressed as a linear function in slope-intercept form, and the correlation coefficient.

When rounding to three decimal places, the equation for the line of best fit and the correlation coefficient are expressed as follows.

$y=8.160x−417.812andr=0.848 $

Based on the correlation coefficient, and without looking at the scatter plot of the data, it can be concluded that the correlation is very strong and positive.

b To predict LeBron's weight, the line of best fit found in Part A can be used. In doing so, substitute LeBron's height into the equation of the line.

$y=8.160x−417.812$

▼

Substitute $x$ for $81.10$ and evaluate

Substitute

$x=81.10$

$y=8.160(81.10)−417.812$

Multiply

Multiply

$y=661.776−417.812$

SubTerms

Subtract terms

$y=243.964$

After analyzing the scatter plot, it can be concluded that the data is well described by a linear model. Therefore, the conclusion made in Part A is correct — the data has a very strong positive correlation. This would imply that the prediction of LeBron's weight is good enough. What clever work by Maya!

Example

Maya's eSports company is interested in promoting their games through movies. Her supervisor has asked her to study movie releases so they can prepare a business strategy. The number of films released by $20th$ Century Fox from $2000$ to $2018$ is given in the following table.

a Using the given data, determine the line of best fit and the correlation coefficient. Be sure to round to three decimal places. Then, classify the strength of the correlation.

b Estimate the number of films released by $20th$ Century Fox in $2019$ by using the line of best fit. Is it a reliable estimation?

a **Line of Best Fit:** $y=-0.067x+150.354$

**Correlation Coefficient:** $r=-0.127$

**Strength:** Very weak negative correlation

b **Estimated Number of Films Released:** $15$

**Is It Reliable?** No, the correlation is very weak, as $r=-0.127.$ Also, the scatter plot representing the given data and the scatter plot of the residuals reveal that a linear model is not appropriate to describe the data.

a Use a graphing calculator to find the line of best fit and the correlation coefficient. If $∣r∣$ is close to $1$ the correlation is strong; if $∣r∣$ is close to zero, the correlation is weak or there is no correlation.

b To determine whether the estimation obtained with the best line of fit is good or not, make a scatter plot of the data. The estimation is good if the correlation is strong and a linear model describes the data good enough.

a With the help of a graphing calculator, the line of best and the correlation coefficient can be found. To start, press the $STAT $ button, select the option

Edit, and introduce the data.

Next, press the $STAT $ button, select the menu item **CALC**, and choose the option LinReg($ax+b$)

. This option gives the line of best fit expressed as a linear equation in slope-intercept form. It also gives the correlation coefficient.

$y=-0.067x+150.354andr=-0.127 $

Because the correlation coefficient is both close to $0$ and negative, the data has a very weak and negative correlation.
b To estimate the number of films released by $20th$ Century Fox in $2019,$ evaluate the line of best fit found in Part A at $x=2019.$

$y=-0.067x+150.354$

▼

Substitute $x$ for $2019$ and evaluate

$y=15.081$

The scatter plot reveals that the data is not well described by a linear model. This could be expected since the correlation between the data is very weak. Consequently, the estimation made is not reliable. In fact, the actual number of films released by $20th$ Century Fox in $2019$ was $10.$

Additionally, by analyzing the scatter plot of the residuals, it can also concluded that a linear model is not appropriate to describe the data.

Note that there are points that are not so close to the $x-$axis. Hence, the line of best fit is not so reliable to make predictions.

Discussion

There might be cases where a data set presents more than a correlation. For example, when the change in one variable directly affects the other, it is said that there is a *causation*.

Concept

Causation is a relationship between two quantities where one quantity is affected by the other. That is, the change in one quantity *directly* affects the other. When there is causation between two data sets, that is said to be a causal relationship. Lastly, causation implies correlation.

$Causal relationships alwaysproduce a strong correlation. $

Consider a situation where an employee earns an hourly wage. Since the number of hours worked directly affects the worker's income, there is both a positive correlation and a causal relationship. The more hours the employee works, the more income the worker makes.
However, the converse is not true. While two data sets can be correlated, they might not have a causal relation.

$Correlation does not always implya causal relationship. $

For example, consider the number of factories and the number of teachers in a city. There could be a positive correlation between these two numbers because both tend to increase as the city's population increases.
However, it is unlikely that making more factories will cause an increase in the number of teachers. Thus, it can be said that there is no causal relationship. In this case, a third factor — population size — seems to directly affect the two data sets in question. That further provides evidence against a causal relationship between the number of factories and the number of teachers.

Example

Before Maya's eSports company can consider expanding, there is a problem. A few parents funded a study against video games! This scatter plot shows the relationship between the number of hours a group of students spend playing video games each week and their grade point averages. The coordinates of each point can be viewed by hovering over it.
### Hint

### Solution

a Try to prove the study wrong. Otherwise, there goes time spent playing video games! Determine how strong the correlation is for their data.

{"type":"choice","form":{"alts":["No correlation","Very weak positive correlation","Very strong negative correlation","Strong positive correlation","Moderate negative correlation"],"noSort":false},"formTextBefore":"","formTextAfter":"","answer":2}

b Is there a causal relationship? Explain the reasoning.

{"type":"choice","form":{"alts":["Yes","Not necessarily"],"noSort":true},"formTextBefore":"","formTextAfter":"","answer":1}

a Note that as the number of hours playing video games increases, the grade point average tends to decrease. Use a graphing calculator to find the correlation coefficient.

b Does the time spent playing video games directly affects the grades of a student? Could a student have good grades and play video games for many hours each week?

a From the scatter plot, the data shows a negative correlation — as the hours playing video games increases the grade point average tends to decrease. Besides, the data also seems to be well described by a linear model. By using a graphing calculator, both the line of best fit and the correlation coefficient can be found.

From the graph and the correlation coefficient value, it can be concluded that the given data has a very strong negative correlation.

b Although the correlation between the data is strong, that does not necessarily indicate a causal relationship. Spending a lot of time playing video games does not directly cause a student to get lower grades. A student could play $10$ hours each week and still have a GPA above $4.$

$The data is strongly correlated butthere is not a causal relationship. $

However, as discussed previously, another factor might be influencing the GPA. It is likely that students who spend a lot of time playing video games spend less time studying, and therefore they tend to have lower GPA's.
Example

After the study came out showing a negative impact on grades because of too many hours playing video games, Maya's eSports company lost revenue. They will now try to reduce their electric bill for their lights, computers, and air conditioning. Maya created a table that shows the monthly average outdoor temperature and last year's prices.

January | February | March | April | May | June | July | August | September | October | November | December | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Outdoor Temperature $(_{∘}$F$)$ | $70$ | $74$ | $80$ | $82$ | $86$ | $88$ | $92$ | $90$ | $84$ | $78$ | $76$ | $72$ |

Bill $($)$ | $185$ | $220$ | $260$ | $263$ | $275$ | $280$ | $310$ | $290$ | $272$ | $240$ | $230$ | $194$ |

Maya is told that the office was open the same number of hours every month and the electrical appliances of the office always had the same settings throughout the entire year.

a How strong is the correlation between the two data sets?

{"type":"choice","form":{"alts":["Very Strong Positive Correlation","Very Strong Negative Correlation","Strong Positive Correlation","Moderate Negative Correlation","Very Weak Positive Correlation"],"noSort":false},"formTextBefore":"","formTextAfter":"","answer":0}

b If the office was open the same number of hours every month and the price of electricity was stable last year, is there a causal relationship between the outside temperature and the bills? Explain the reasoning.

{"type":"choice","form":{"alts":["Yes","Not necessarily"],"noSort":true},"formTextBefore":"","formTextAfter":"","answer":0}

a Note that as outdoor temperatures rise, bill amounts tend to increase. Use a graphing calculator to find the correlation coefficient.

b Does the outdoor temperature directly affect bill amount? Notice that the office was open the same number of hours every month and the electrical appliances of the office always had the same settings. Also, the fluctuation of the price of the electricity had little variance.

a With the help of a graphing calculator, the line of best fit and the correlation coefficient can be found. First, enter the data in the calculator. To do so, press the $STAT $ button, select the option

Edit, and enter the data.

Next, press the $STAT $ button, select the menu item **CALC**, and choose the option LinReg($ax+b$)

. This option gives the line of best fit which is expressed as a linear equation in slope-intercept form. It also gives the correlation coefficient.

The correlation coefficient is almost $1.$ That means that the data has a very strong positive correlation. Also, as seen in the scatter plot of the data, as the temperature increases, the price of the bill increases.

b Two conditions are given about the eSports office. Firstly, the same electrical appliances are always on and with the same settings. Secondly, the office was open the same number of hours each month and the price of electricity was stable.

$Every Month✓✓✓✓ The same electrical appliancesThe same settingsThe same number of hours workedThe price of electricity was stable $

Under these conditions, the electricity consumption should have been the same every month. That would then indicate that the bill amounts should have been roughly the same. However, when analyzing the scatter plot in Part A, it is seen that the bill amounts increased when the temperature increased.
Therefore, the outdoor temperature seems to influence the bill amount, and thus, there is a causal relationship. This seems logical because when the outdoor temperature increases, the air conditioner has to work harder to maintain the interior temperature, generating more electricity consumption.
Notice that the points do not perfectly follow a linear model; this can be due to the electricity price, which was stable but probably fluctuated a bit from month to month.

Closure

Notice that the correlation between the given data was very strong in the last two examples, *Video Games vs. Grades* and *Temperature vs. the Electric Bill*. However, a causal relationship could be established only in the second example. Therefore, it is always important to make a general study of the situation and look for *hidden variables* when analyzing data.

Although the app used for text messaging might have some kind of error that causes the phone to freeze, there is likely another cause for the freezing. For example, it could be that the phone is running several apps in the background, and it freezes due to the lack of *Random-access memory* (RAM).

Loading content