{{ toc.name }}
{{ toc.signature }}
{{ toc.name }} {{ 'ml-btn-view-details' | message }}
{{ stepNode.name }}
{{ 'ml-toc-proceed' | message }}
Lesson
Exercises
Recommended
Tests
An error ocurred, try again later!
Chapter {{ article.chapter.number }}
{{ article.number }}. 

{{ article.displayTitle }}

{{ article.intro.summary }}
{{ 'ml-btn-show-less' | message }} {{ 'ml-btn-show-more' | message }} expand_more
{{ 'ml-heading-abilities-covered' | message }}
{{ ability.description }} {{ ability.displayTitle }}
{{ 'ml-heading-lesson-settings' | message }}
{{ 'ml-lesson-show-solutions' | message }}
{{ 'ml-lesson-show-hints' | message }}
{{ 'ml-lesson-number-slides' | message : article.intro.bblockCount}}
{{ 'ml-lesson-number-exercises' | message : article.intro.exerciseCount}}
{{ 'ml-lesson-time-estimation' | message }}
Looking for a correlation between data sets can help to determine if different phenomena are related. If a correlation exists, it is important to be able to measure it. This lesson will explore one of the quantities that measure the strength of a correlation. Also, it will investigate whether or not one data set is directly affected by another data set.

Catch-Up and Review

Here are a few recommended readings before getting started with this lesson.

Explore

Anscombe's Quartet

Consider the following four data sets called the Anscombe's quartet. Notice that the first three sets have the same values. Explore the following applet's options to investigate and compare the characteristics of each set.
Anscombe's Quartet
The four sets have the same mean, standard deviation, and the same line of regression. Therefore, it seems reasonable to think that the scatter plots of these sets are similar. Use the following graph to discover each scatter plot and be ready for a surprise.
Scatter Plot of the Anscombe's Quartet
In noticing the contrast between what the statistical characteristics suggest and what the diagrams show, what conclusion can be drawn about analyzing data sets, in general?

Discussion

Strength of Correlation

When finding the line of best fit of a data set using a graphing calculator, it usually gives a constant which measures the strength of the correlation. This constant is called the correlation coefficient.

Concept

Correlation Coefficient

The correlation coefficient, usually denoted by measures the direction and strength of a linear relationship between two variables. It can take on values between and Values near mean that the correlation is strong and negative, while values close to are strong and positive. Values close to represent a weak or very weak correlation, but represents no correlation.
r=1: perfect positive correlation; r in [0.75,1): strong positive correlation; r in [0.3,0.75): moderate positive correlation; r in [0.15,0.3): weak positive correlation; r in [0,0.15):no correlation; equivalently for the negative values
When there is a linear model that describes the relationship between two variables well, the correlation coefficient indicates how close the points are to the line of best fit. The closer the value is to or the closer the points to the line of best fit.
Group of points moving as the correlation coefficient changes
Keep in mind that the correlation coefficient is useful only when a linear model describes the data well. In addition to understanding the meaning of the correlation coefficient, while a graphing calculator is able to find the line of best fit, it is beneficial to learn how to calculate it by hand. Consider the following formula.

Extra

Formula for Finding the Pearson Correlation Coefficient
Given a data set with points the Pearson correlation coefficient can be found by dividing the covariance of and by the product of their standard deviations.
Alternatively, the formula can be rewritten as follows.
Formula for finding the Pearson correlation coefficient

Although there are different types of correlation coefficients, the most commonly used is the Pearson correlation coefficient.

Knowing the correlation coefficient can give a clue about how well the regression line models the data. Nevertheless, it is always best practice to make a scatter plot before drawing any conclusions. As an example, each set in the Anscombe's quartet has a correlation coefficient of implying a strong positive correlation.

Scatter Plot of the Anscombe's Quart
However, a linear model does not seem to fit the second set of data well.

Example

Correlation Coefficient Analysis

Pair each scatter plot with the most appropriate correlation coefficient.

Four Scatter Plots: One with strong positive correlation, one with no correlation, one with strong negative correlation, and one with very strong negative correlation

Hint

Try drawing a line that fits each data set. If the points are grouped near the line, the correlation coefficient should be close either to or If the slope of the line is positive, the correlation coefficient is positive, and if the slope of the line is negative, the correlation coefficient is negative.

Solution

To pair each scatter plot with its corresponding correlation coefficient, analyze each scatter plot one at a time.

Scatter Plot

The points in Scatter Plot do not seem to follow a particular direction since they are randomly distributed. Different lines can be drawn; however, none of them seem to describe the relationship between the points.
Scatter Plot A and three different lines drawn. No line models well the data
Therefore, there is no correlation between the points of Scatter Plot That implies the correlation coefficient is close to Considering the given options, only meets this condition.

Scatter Plot

In Scatter Plot the points seem to follow a certain direction. As increases, tends to decrease which implies a negative correlation. Here, a line with a negative slope can be drawn to model the data.

Scatter Plot B and one line with negative slope approaching the data

Hence, the correlation coefficient must also be negative. Considering the given options, there are two possible values for or At this time, however, it cannot be concluded which corresponds to Scatter Plot

Scatter Plot

The points in Scatter Plot follow an ascending trajectory — as increases, tends to increase as well. This relationship implies a positive correlation. Here, a line with a positive slope seems to be a good candidate for the line of best fit.

Scatter Plot C and one line with positive slope modeling the data very well

Therefore, the correlation coefficient is positive. Likewise, since all the points are close to the line, the correlation coefficient must be close to Consequently, the best choice for the correlation coefficient of Scatter Plot is

Scatter Plot

In Scatter Plot the points follow a descending direction — as increases, tends to decrease. This relationship implies a negative correlation. Like Scatter Plot a line with negative slope is a good candidate to be the line of best fit.

Scatter Plot D and one line with negative slope approaching the data very well

Therefore, the correlation coefficient is negative. As before, there are two options.

When comparing Scatter Plots and the points on are more clustered and closer to following a linear pattern than those on scatter plot Thus, the correlation in scatter plot is stronger than the correlation in scatter plot Consequently, is the most appropriate correlation coefficient for scatter plot

In the following table, each scatter plot is paired with its correlation coefficient, and the strength of the correlation is described.

Scatter Plot Correlation Coefficient Strength
Very weak positive correlation
Strong negative correlation
Very strong positive correlation
Very strong negative correlation

Pop Quiz

Finding the Correlation Coefficient

For each data set, find the correlation coefficient. Give the answer rounded to two decimal places.

random scatter plots and a table of values

Example

Predicting LeBron James' Weight

Maya works at an eSports company that has offered a customized gaming chair for each player on the Los Angeles Lakers season roster. The measurements of each player are listed in the following table.

Height (in)
Weight (lbs)
a Find the line of best fit and the correlation coefficientrounded to three decimal places — for the given data. Then, classify its strength.
b Maya noticed that LeBron James' measurements are missing. She gave him a call and asked for his height and weight. Oh no! Before he could say his weight, the phone reception was lost! Unable to reach LeBron again, she decided to approximate his weight. If LeBron is inches tall, approximately, what is Lebron's weight? Is it a good prediction?

Answer

a Line of Best Fit:

Correlation Coefficient:
Strength: Very strong positive correlation

b Approximated Weight: pounds

Is it a good prediction? Yes, by analyzing the scatter plot, a linear model seems to describe the data good enough. Also, the correlation coefficient is close to

Hint

a Use a graphing calculator to find the line of best fit and the correlation coefficient. When the value of is close to the correlation is strong.
b To determine LeBron's weight, use the line of best fit to predict. To determine whether the prediction is good, make a scatter plot of the data. The prediction is good if a linear model describes the data well enough and the correlation is strong.

Solution

a The line of best fit and the correlation coefficient can be found using a graphing calculator. First, the data values need to be entered into the calculator. This is done by pressing the button and selecting the option Edit.
The window in the calculator, which shows Stat and then Edit
Calculator that shows two lists where you entered values

Finally, by pressing the button, and then selecting the menu item CALC, the option LinReg() can be found. This option gives a line of best fit, expressed as a linear function in slope-intercept form, and the correlation coefficient.

When rounding to three decimal places, the equation for the line of best fit and the correlation coefficient are expressed as follows.

Based on the correlation coefficient, and without looking at the scatter plot of the data, it can be concluded that the correlation is very strong and positive.

b To predict LeBron's weight, the line of best fit found in Part A can be used. In doing so, substitute LeBron's height into the equation of the line.
Substitute for and evaluate
Consequently, LeBron is about pounds. Despite the data being strongly correlated, to decide whether this prediction is good, it is best practice to draw a scatter plot and the line of best fit.
Scatter Plot of the data and line of Best Fit. The data is very close to the line

After analyzing the scatter plot, it can be concluded that the data is well described by a linear model. Therefore, the conclusion made in Part A is correct — the data has a very strong positive correlation. This would imply that the prediction of LeBron's weight is good enough. What clever work by Maya!

Scatter Plot with the estimated and actual weight of LeBron James

Example

Films Released per Year

Maya's eSports company is interested in promoting their games through movies. Her supervisor has asked her to study movie releases so they can prepare a business strategy. The number of films released by Century Fox from to is given in the following table.

Number of Movies [14,16,15,14,15,19,25,17,21,16,16,15,14,17,18,18,16,14,12]
a Using the given data, determine the line of best fit and the correlation coefficient. Be sure to round to three decimal places. Then, classify the strength of the correlation.
b Estimate the number of films released by Century Fox in by using the line of best fit. Is it a reliable estimation?

Answer

a Line of Best Fit:

Correlation Coefficient:
Strength: Very weak negative correlation

b Estimated Number of Films Released:

Is It Reliable? No, the correlation is very weak, as Also, the scatter plot representing the given data and the scatter plot of the residuals reveal that a linear model is not appropriate to describe the data.

Hint

a Use a graphing calculator to find the line of best fit and the correlation coefficient. If is close to the correlation is strong; if is close to zero, the correlation is weak or there is no correlation.
b To determine whether the estimation obtained with the best line of fit is good or not, make a scatter plot of the data. The estimation is good if the correlation is strong and a linear model describes the data good enough.

Solution

a With the help of a graphing calculator, the line of best and the correlation coefficient can be found. To start, press the button, select the option Edit, and introduce the data.
The window in the calculator, which shows Stat and then Edit
Calculator that shows two lists where you entered values

Next, press the button, select the menu item CALC, and choose the option LinReg(). This option gives the line of best fit expressed as a linear equation in slope-intercept form. It also gives the correlation coefficient.

After rounding to three decimal places, the line of best fit and the correlation coefficient can be expressed as follows.
Because the correlation coefficient is both close to and negative, the data has a very weak and negative correlation.
b To estimate the number of films released by Century Fox in evaluate the line of best fit found in Part A at
Substitute for and evaluate
Since the number of films needs to be an integer, the line of best fit estimates that Century Fox released films in To determine whether the estimation is reliable, make a scatter plot of the data and draw the line of best fit.
Scatter plot of the movies released per year

The scatter plot reveals that the data is not well described by a linear model. This could be expected since the correlation between the data is very weak. Consequently, the estimation made is not reliable. In fact, the actual number of films released by Century Fox in was

Scatter plot of the movies released per year with the estimated number and the actual number

Additionally, by analyzing the scatter plot of the residuals, it can also concluded that a linear model is not appropriate to describe the data.

Scatter plot of residuals

Note that there are points that are not so close to the axis. Hence, the line of best fit is not so reliable to make predictions.

Discussion

Correlation and Causation

There might be cases where a data set presents more than a correlation. For example, when the change in one variable directly affects the other, it is said that there is a causation.

Concept

Causation

Causation is a relationship between two quantities where one quantity is affected by the other. That is, the change in one quantity directly affects the other. When there is causation between two data sets, that is said to be a causal relationship. Lastly, causation implies correlation.
Consider a situation where an employee earns an hourly wage. Since the number of hours worked directly affects the worker's income, there is both a positive correlation and a causal relationship. The more hours the employee works, the more income the worker makes.
A clock and the total income. As hours worked increases, the income increases.
However, the converse is not true. While two data sets can be correlated, they might not have a causal relation.
For example, consider the number of factories and the number of teachers in a city. There could be a positive correlation between these two numbers because both tend to increase as the city's population increases.
The number of factories and teachers in a city could present a positive correlation

However, it is unlikely that making more factories will cause an increase in the number of teachers. Thus, it can be said that there is no causal relationship. In this case, a third factor — population size — seems to directly affect the two data sets in question. That further provides evidence against a causal relationship between the number of factories and the number of teachers.

A better cause for the increase in the number of factories and teachers is the population size

Example

Video Games and Grades: Parent's Against Video Games!

Before Maya's eSports company can consider expanding, there is a problem. A few parents funded a study against video games! This scatter plot shows the relationship between the number of hours a group of students spend playing video games each week and their grade point averages. The coordinates of each point can be viewed by hovering over it.
Data = {[2,3.7],[3,3.2],[4,3.4],[5,3.2],[6,3.3],[7,2.8],[8,2.9],[9,3.1],[10,2.7],[11,3.0],[12,2.8],[14,2.5],[15,2.8],[16,2.4],[18,2.2],[19,2]} first component is hours playing video games and second component grades
a Try to prove the study wrong. Otherwise, there goes time spent playing video games! Determine how strong the correlation is for their data.
b Is there a causal relationship? Explain the reasoning.

Hint

a Note that as the number of hours playing video games increases, the grade point average tends to decrease. Use a graphing calculator to find the correlation coefficient.
b Does the time spent playing video games directly affects the grades of a student? Could a student have good grades and play video games for many hours each week?

Solution

a From the scatter plot, the data shows a negative correlation — as the hours playing video games increases the grade point average tends to decrease. Besides, the data also seems to be well described by a linear model. By using a graphing calculator, both the line of best fit and the correlation coefficient can be found.
Data = {[2,3.7],[3,3.2],[4,3.4],[5,3.2],[6,3.3],[7,2.8],[8,2.9],[9,3.1],[10,2.7],[11,3.0],[12,2.8],[14,2.5],[15,2.8],[16,2.4],[18,2.2],[19,2]} first component is hours playing video games and second component grades; r=-0.92; y=-0.08x+3.65
From the graph and the correlation coefficient value, it can be concluded that the given data has a very strong negative correlation.
b Although the correlation between the data is strong, that does not necessarily indicate a causal relationship. Spending a lot of time playing video games does not directly cause a student to get lower grades. A student could play hours each week and still have a GPA above
However, as discussed previously, another factor might be influencing the GPA. It is likely that students who spend a lot of time playing video games spend less time studying, and therefore they tend to have lower GPA's.

Example

Temperature and the Electric Bill

After the study came out showing a negative impact on grades because of too many hours playing video games, Maya's eSports company lost revenue. They will now try to reduce their electric bill for their lights, computers, and air conditioning. Maya created a table that shows the monthly average outdoor temperature and last year's prices.

January February March April May June July August September October November December
Outdoor Temperature F
Bill

Maya is told that the office was open the same number of hours every month and the electrical appliances of the office always had the same settings throughout the entire year.

a How strong is the correlation between the two data sets?
b If the office was open the same number of hours every month and the price of electricity was stable last year, is there a causal relationship between the outside temperature and the bills? Explain the reasoning.

Hint

a Note that as outdoor temperatures rise, bill amounts tend to increase. Use a graphing calculator to find the correlation coefficient.
b Does the outdoor temperature directly affect bill amount? Notice that the office was open the same number of hours every month and the electrical appliances of the office always had the same settings. Also, the fluctuation of the price of the electricity had little variance.

Solution

a With the help of a graphing calculator, the line of best fit and the correlation coefficient can be found. First, enter the data in the calculator. To do so, press the button, select the option Edit, and enter the data.
The window in the calculator, which shows Stat and then Edit
Calculator that shows two lists where you entered values

Next, press the button, select the menu item CALC, and choose the option LinReg(). This option gives the line of best fit which is expressed as a linear equation in slope-intercept form. It also gives the correlation coefficient.

The correlation coefficient is almost That means that the data has a very strong positive correlation. Also, as seen in the scatter plot of the data, as the temperature increases, the price of the bill increases.

Scatter Plot Temperature vs Amount of the Bill
b Two conditions are given about the eSports office. Firstly, the same electrical appliances are always on and with the same settings. Secondly, the office was open the same number of hours each month and the price of electricity was stable.
Under these conditions, the electricity consumption should have been the same every month. That would then indicate that the bill amounts should have been roughly the same. However, when analyzing the scatter plot in Part A, it is seen that the bill amounts increased when the temperature increased.
Scatter Plot Temperature vs Price Bill
Therefore, the outdoor temperature seems to influence the bill amount, and thus, there is a causal relationship. This seems logical because when the outdoor temperature increases, the air conditioner has to work harder to maintain the interior temperature, generating more electricity consumption.
Air Conditioning and Thermometer. As the temperature increases, the air conditioner works harder
Notice that the points do not perfectly follow a linear model; this can be due to the electricity price, which was stable but probably fluctuated a bit from month to month.

Closure

Causation: Looking for Hidden Variables

Notice that the correlation between the given data was very strong in the last two examples, Video Games vs. Grades and Temperature vs. the Electric Bill. However, a causal relationship could be established only in the second example. Therefore, it is always important to make a general study of the situation and look for hidden variables when analyzing data.

Maya says: When I want to chat from my phone, most of the time, the texting app fails and it freezes the phone. So, the cause of the freezes is the texting app. I'm going to uninstall it.

Although the app used for text messaging might have some kind of error that causes the phone to freeze, there is likely another cause for the freezing. For example, it could be that the phone is running several apps in the background, and it freezes due to the lack of Random-access memory (RAM).

Image of a Phone. Lack of RAM is the cause of the texting app failing and the phone freezes.
Also, it might be that the app used for text messaging is the one that Maya uses the most. If that is the case, it is more likely that when her phone freezes, she will be coincidentally using the text messaging app. To avoid misinterpreting causal relations, it is essential to always look for hidden variables before coming to a conclusion.