{{ toc.signature }}
{{ toc.name }}
{{ stepNode.name }}
Proceed to next lesson
An error ocurred, try again later!
Chapter {{ article.chapter.number }}
{{ article.number }}.

# {{ article.displayTitle }}

{{ article.introSlideInfo.summary }}
{{ 'ml-btn-show-less' | message }} {{ 'ml-btn-show-more' | message }} expand_more
##### {{ 'ml-heading-abilities-covered' | message }}
{{ ability.description }}

#### {{ 'ml-heading-lesson-settings' | message }}

{{ 'ml-lesson-show-solutions' | message }}
{{ 'ml-lesson-show-hints' | message }}
 {{ 'ml-lesson-number-slides' | message : article.introSlideInfo.bblockCount}} {{ 'ml-lesson-number-exercises' | message : article.introSlideInfo.exerciseCount}} {{ 'ml-lesson-time-estimation' | message }}
Looking for a correlation between data sets can help to determine if different phenomena are related. If a correlation exists, it is important to be able to measure it. This lesson will explore one of the quantities that measure the strength of a correlation. Also, it will investigate whether or not one data set is directly affected by another data set.

### Catch-Up and Review

Here are a few recommended readings before getting started with this lesson.

## Anscombe's Quartet

Consider the following four data sets called the Anscombe's quartet. Notice that the first three sets have the same values. Explore the following applet's options to investigate and compare the characteristics of each set. The four sets have the same mean, standard deviation, and the same line of regression. Therefore, it seems reasonable to think that the scatter plots of these sets are similar. Use the following graph to discover each scatter plot and be ready for a surprise. In noticing the contrast between what the statistical characteristics suggest and what the diagrams show, what conclusion can be drawn about analyzing data sets, in general?

## Strength of Correlation

When finding the line of best fit of a data set using a graphing calculator, it usually gives a constant which measures the strength of the correlation. This constant is called the correlation coefficient.

## Correlation Coefficient

The correlation coefficient, usually denoted by measures the direction and strength of a linear relationship between two variables. It can take on values between and Values near mean that the correlation is strong and negative, while values close to are strong and positive. Values close to represent a weak or very weak correlation, but represents no correlation. When there is a linear model that describes the relationship between two variables well, the correlation coefficient indicates how close the points are to the line of best fit. The closer the value is to or the closer the points to the line of best fit. Keep in mind that the correlation coefficient is useful only when a linear model describes the data well. In addition to understanding the meaning of the correlation coefficient, while a graphing calculator is able to find the line of best fit, it is beneficial to learn how to calculate it by hand. Consider the following formula.

### Extra

Formula for Finding the Pearson Correlation Coefficient
Given a data set with points the Pearson correlation coefficient can be found by dividing the covariance of and by the product of their standard deviations.
Alternatively, the formula can be rewritten as follows. Although there are different types of correlation coefficients, the most commonly used is the Pearson correlation coefficient.

Knowing the correlation coefficient can give a clue about how well the regression line models the data. Nevertheless, it is always best practice to make a scatter plot before drawing any conclusions. As an example, each set in the Anscombe's quartet has a correlation coefficient of implying a strong positive correlation. However, a linear model does not seem to fit the second set of data well.

## Correlation Coefficient Analysis

Pair each scatter plot with the most appropriate correlation coefficient. ### Hint

Try drawing a line that fits each data set. If the points are grouped near the line, the correlation coefficient should be close either to or If the slope of the line is positive, the correlation coefficient is positive, and if the slope of the line is negative, the correlation coefficient is negative.

### Solution

To pair each scatter plot with its corresponding correlation coefficient, analyze each scatter plot one at a time.

### Scatter Plot

The points in Scatter Plot do not seem to follow a particular direction since they are randomly distributed. Different lines can be drawn; however, none of them seem to describe the relationship between the points. Therefore, there is no correlation between the points of Scatter Plot That implies the correlation coefficient is close to Considering the given options, only meets this condition.

### Scatter Plot

In Scatter Plot the points seem to follow a certain direction. As increases, tends to decrease which implies a negative correlation. Here, a line with a negative slope can be drawn to model the data. Hence, the correlation coefficient must also be negative. Considering the given options, there are two possible values for or At this time, however, it cannot be concluded which corresponds to Scatter Plot

### Scatter Plot

The points in Scatter Plot follow an ascending trajectory — as increases, tends to increase as well. This relationship implies a positive correlation. Here, a line with a positive slope seems to be a good candidate for the line of best fit. Therefore, the correlation coefficient is positive. Likewise, since all the points are close to the line, the correlation coefficient must be close to Consequently, the best choice for the correlation coefficient of Scatter Plot is

### Scatter Plot

In Scatter Plot the points follow a descending direction — as increases, tends to decrease. This relationship implies a negative correlation. Like Scatter Plot a line with negative slope is a good candidate to be the line of best fit. Therefore, the correlation coefficient is negative. As before, there are two options.

When comparing Scatter Plots and the points on are more clustered and closer to following a linear pattern than those on scatter plot Thus, the correlation in scatter plot is stronger than the correlation in scatter plot Consequently, is the most appropriate correlation coefficient for scatter plot

In the following table, each scatter plot is paired with its correlation coefficient, and the strength of the correlation is described.

Scatter Plot Correlation Coefficient Strength
Very weak positive correlation
Strong negative correlation
Very strong positive correlation
Very strong negative correlation

## Finding the Correlation Coefficient

For each data set, find the correlation coefficient. Give the answer rounded to two decimal places. ## Predicting LeBron James' Weight

Maya works at an eSports company that has offered a customized gaming chair for each player on the Los Angeles Lakers season roster. The measurements of each player are listed in the following table.

a Find the line of best fit and the correlation coefficientrounded to three decimal places — for the given data. Then, classify its strength.
b Maya noticed that LeBron James' measurements are missing. She gave him a call and asked for his height and weight. Oh no! Before he could say his weight, the phone reception was lost! Unable to reach LeBron again, she decided to approximate his weight. If LeBron is inches tall, approximately, what is Lebron's weight? Is it a good prediction?

a Line of Best Fit:

Correlation Coefficient:
Strength: Very strong positive correlation

b Approximated Weight: pounds

Is it a good prediction? Yes, by analyzing the scatter plot, a linear model seems to describe the data good enough. Also, the correlation coefficient is close to

### Hint

a Use a graphing calculator to find the line of best fit and the correlation coefficient. When the value of is close to the correlation is strong.
b To determine LeBron's weight, use the line of best fit to predict. To determine whether the prediction is good, make a scatter plot of the data. The prediction is good if a linear model describes the data well enough and the correlation is strong.

### Solution

a The line of best fit and the correlation coefficient can be found using a graphing calculator. First, the data values need to be entered into the calculator. This is done by pressing the button and selecting the option Edit.  Finally, by pressing the button, and then selecting the menu item CALC, the option LinReg() can be found. This option gives a line of best fit, expressed as a linear function in slope-intercept form, and the correlation coefficient.  When rounding to three decimal places, the equation for the line of best fit and the correlation coefficient are expressed as follows.

Based on the correlation coefficient, and without looking at the scatter plot of the data, it can be concluded that the correlation is very strong and positive.

b To predict LeBron's weight, the line of best fit found in Part A can be used. In doing so, substitute LeBron's height into the equation of the line.
Substitute for and evaluate
Consequently, LeBron is about pounds. Despite the data being strongly correlated, to decide whether this prediction is good, it is best practice to draw a scatter plot and the line of best fit. After analyzing the scatter plot, it can be concluded that the data is well described by a linear model. Therefore, the conclusion made in Part A is correct — the data has a very strong positive correlation. This would imply that the prediction of LeBron's weight is good enough. What clever work by Maya! ## Films Released per Year

Maya's eSports company is interested in promoting their games through movies. Her supervisor has asked her to study movie releases so they can prepare a business strategy. The number of films released by Century Fox from to is given in the following table. a Using the given data, determine the line of best fit and the correlation coefficient. Be sure to round to three decimal places. Then, classify the strength of the correlation.
b Estimate the number of films released by Century Fox in by using the line of best fit. Is it a reliable estimation?

a Line of Best Fit:

Correlation Coefficient:
Strength: Very weak negative correlation

b Estimated Number of Films Released:

Is It Reliable? No, the correlation is very weak, as Also, the scatter plot representing the given data and the scatter plot of the residuals reveal that a linear model is not appropriate to describe the data.

### Hint

a Use a graphing calculator to find the line of best fit and the correlation coefficient. If is close to the correlation is strong; if is close to zero, the correlation is weak or there is no correlation.
b To determine whether the estimation obtained with the best line of fit is good or not, make a scatter plot of the data. The estimation is good if the correlation is strong and a linear model describes the data good enough.

### Solution

a With the help of a graphing calculator, the line of best and the correlation coefficient can be found. To start, press the button, select the option Edit, and introduce the data.  Next, press the button, select the menu item CALC, and choose the option LinReg(). This option gives the line of best fit expressed as a linear equation in slope-intercept form. It also gives the correlation coefficient.  After rounding to three decimal places, the line of best fit and the correlation coefficient can be expressed as follows.
Because the correlation coefficient is both close to and negative, the data has a very weak and negative correlation.
b To estimate the number of films released by Century Fox in evaluate the line of best fit found in Part A at
Substitute for and evaluate
Since the number of films needs to be an integer, the line of best fit estimates that Century Fox released films in To determine whether the estimation is reliable, make a scatter plot of the data and draw the line of best fit. The scatter plot reveals that the data is not well described by a linear model. This could be expected since the correlation between the data is very weak. Consequently, the estimation made is not reliable. In fact, the actual number of films released by Century Fox in was Additionally, by analyzing the scatter plot of the residuals, it can also concluded that a linear model is not appropriate to describe the data. Note that there are points that are not so close to the axis. Hence, the line of best fit is not so reliable to make predictions.

## Correlation and Causation

There might be cases where a data set presents more than a correlation. For example, when the change in one variable directly affects the other, it is said that there is a causation.

## Causation

Causation is a relationship between two quantities where one quantity is affected by the other. That is, the change in one quantity directly affects the other. When there is causation between two data sets, that is said to be a causal relationship. Lastly, causation implies correlation.
Consider a situation where an employee earns an hourly wage. Since the number of hours worked directly affects the worker's income, there is both a positive correlation and a causal relationship. The more hours the employee works, the more income the worker makes. However, the converse is not true. While two data sets can be correlated, they might not have a causal relation.
For example, consider the number of factories and the number of teachers in a city. There could be a positive correlation between these two numbers because both tend to increase as the city's population increases. However, it is unlikely that making more factories will cause an increase in the number of teachers. Thus, it can be said that there is no causal relationship. In this case, a third factor — population size — seems to directly affect the two data sets in question. That further provides evidence against a causal relationship between the number of factories and the number of teachers. ## Video Games and Grades: Parent's Against Video Games!

Before Maya's eSports company can consider expanding, there is a problem. A few parents funded a study against video games! This scatter plot shows the relationship between the number of hours a group of students spend playing video games each week and their grade point averages. The coordinates of each point can be viewed by hovering over it. a Try to prove the study wrong. Otherwise, there goes time spent playing video games! Determine how strong the correlation is for their data.
b Is there a causal relationship? Explain the reasoning.

### Hint

a Note that as the number of hours playing video games increases, the grade point average tends to decrease. Use a graphing calculator to find the correlation coefficient.
b Does the time spent playing video games directly affects the grades of a student? Could a student have good grades and play video games for many hours each week?

### Solution

a From the scatter plot, the data shows a negative correlation — as the hours playing video games increases the grade point average tends to decrease. Besides, the data also seems to be well described by a linear model. By using a graphing calculator, both the line of best fit and the correlation coefficient can be found. From the graph and the correlation coefficient value, it can be concluded that the given data has a very strong negative correlation.
b Although the correlation between the data is strong, that does not necessarily indicate a causal relationship. Spending a lot of time playing video games does not directly cause a student to get lower grades. A student could play hours each week and still have a GPA above
However, as discussed previously, another factor might be influencing the GPA. It is likely that students who spend a lot of time playing video games spend less time studying, and therefore they tend to have lower GPA's.

## Temperature and the Electric Bill

After the study came out showing a negative impact on grades because of too many hours playing video games, Maya's eSports company lost revenue. They will now try to reduce their electric bill for their lights, computers, and air conditioning. Maya created a table that shows the monthly average outdoor temperature and last year's prices.

January February March April May June July August September October November December
Outdoor Temperature F
Bill

Maya is told that the office was open the same number of hours every month and the electrical appliances of the office always had the same settings throughout the entire year.

a How strong is the correlation between the two data sets?
b If the office was open the same number of hours every month and the price of electricity was stable last year, is there a causal relationship between the outside temperature and the bills? Explain the reasoning.

### Hint

a Note that as outdoor temperatures rise, bill amounts tend to increase. Use a graphing calculator to find the correlation coefficient.
b Does the outdoor temperature directly affect bill amount? Notice that the office was open the same number of hours every month and the electrical appliances of the office always had the same settings. Also, the fluctuation of the price of the electricity had little variance.

### Solution

a With the help of a graphing calculator, the line of best fit and the correlation coefficient can be found. First, enter the data in the calculator. To do so, press the button, select the option Edit, and enter the data.  Next, press the button, select the menu item CALC, and choose the option LinReg(). This option gives the line of best fit which is expressed as a linear equation in slope-intercept form. It also gives the correlation coefficient.  The correlation coefficient is almost That means that the data has a very strong positive correlation. Also, as seen in the scatter plot of the data, as the temperature increases, the price of the bill increases. b Two conditions are given about the eSports office. Firstly, the same electrical appliances are always on and with the same settings. Secondly, the office was open the same number of hours each month and the price of electricity was stable.
Under these conditions, the electricity consumption should have been the same every month. That would then indicate that the bill amounts should have been roughly the same. However, when analyzing the scatter plot in Part A, it is seen that the bill amounts increased when the temperature increased. Therefore, the outdoor temperature seems to influence the bill amount, and thus, there is a causal relationship. This seems logical because when the outdoor temperature increases, the air conditioner has to work harder to maintain the interior temperature, generating more electricity consumption. Notice that the points do not perfectly follow a linear model; this can be due to the electricity price, which was stable but probably fluctuated a bit from month to month.

## Causation: Looking for Hidden Variables

Notice that the correlation between the given data was very strong in the last two examples, Video Games vs. Grades and Temperature vs. the Electric Bill. However, a causal relationship could be established only in the second example. Therefore, it is always important to make a general study of the situation and look for hidden variables when analyzing data. Although the app used for text messaging might have some kind of error that causes the phone to freeze, there is likely another cause for the freezing. For example, it could be that the phone is running several apps in the background, and it freezes due to the lack of Random-access memory (RAM). Also, it might be that the app used for text messaging is the one that Maya uses the most. If that is the case, it is more likely that when her phone freezes, she will be coincidentally using the text messaging app. To avoid misinterpreting causal relations, it is essential to always look for hidden variables before coming to a conclusion.
{{ subexercise.title }}