Sign In
| 11 Theory slides |
| 8 Exercises - Grade E - A |
| Each lesson is meant to take 1-2 classroom sessions |
Here are a few recommended readings before getting started with this lesson.
When finding the line of best fit of a data set using a graphing calculator, it usually gives a constant r which measures the strength of the correlation. This constant is called the correlation coefficient.
Although there are different types of correlation coefficients, the most commonly used is the Pearson correlation coefficient.
Knowing the correlation coefficient can give a clue about how well the regression line models the data. Nevertheless, it is always best practice to make a scatter plot before drawing any conclusions. As an example, each set in the Anscombe's quartet has a correlation coefficient of 0.816, implying a strong positive correlation.
Pair each scatter plot with the most appropriate correlation coefficient.
Try drawing a line that fits each data set. If the points are grouped near the line, the correlation coefficient should be close either to 1 or -1. If the slope of the line is positive, the correlation coefficient is positive, and if the slope of the line is negative, the correlation coefficient is negative.
To pair each scatter plot with its corresponding correlation coefficient, analyze each scatter plot one at a time.
Scatter Plot A⟶r=0.08
In Scatter Plot B, the points seem to follow a certain direction. As x increases, y tends to decrease which implies a negative correlation. Here, a line with a negative slope can be drawn to model the data.
Hence, the correlation coefficient must also be negative. Considering the given options, there are two possible values for r: -0.92 or -0.65. At this time, however, it cannot be concluded which corresponds to Scatter Plot B.
The points in Scatter Plot C follow an ascending trajectory — as x increases, y tends to increase as well. This relationship implies a positive correlation. Here, a line with a positive slope seems to be a good candidate for the line of best fit.
Therefore, the correlation coefficient is positive. Likewise, since all the points are close to the line, the correlation coefficient must be close to 1. Consequently, the best choice for the correlation coefficient of Scatter Plot C is r=0.84.
Scatter Plot C⟶r=0.84
In Scatter Plot D, the points follow a descending direction — as x increases, y tends to decrease. This relationship implies a negative correlation. Like Scatter Plot B, a line with negative slope is a good candidate to be the line of best fit.
Therefore, the correlation coefficient is negative. As before, there are two options.
When comparing Scatter Plots B and D, the points on D are more clustered and closer to following a linear pattern than those on scatter plot B. Thus, the correlation in scatter plot D is stronger than the correlation in scatter plot B. Consequently, r=-0.92 is the most appropriate correlation coefficient for scatter plot D.
In the following table, each scatter plot is paired with its correlation coefficient, and the strength of the correlation is described.
Scatter Plot | Correlation Coefficient | Strength |
---|---|---|
A | 0.08 | Very weak positive correlation |
B | -0.65 | Strong negative correlation |
C | 0.84 | Very strong positive correlation |
D | -0.92 | Very strong negative correlation |
For each data set, find the correlation coefficient. Give the answer rounded to two decimal places.
Maya works at an eSports company that has offered a customized gaming chair for each player on the Los Angeles Lakers 2020–2021 season roster. The measurements of each player are listed in the following table.
Height (in) | 72.83 | 74.02 | 75.20 | 75.20 | 75.20 | 75.98 | 75.98 | 75.98 | 75.98 | 77.17 | 77.17 | 77.17 | 79.13 | 79.92 | 81.89 | 81.98 | 83.07 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Weight (lbs) | 178.57 | 189.60 | 182.98 | 198.42 | 198.42 | 194.01 | 205.03 | 233.69 | 218.26 | 178.57 | 213.85 | 196.21 | 235.90 | 213.85 | 251.33 | 264.56 | 264.56 |
Correlation Coefficient: r=0.848
Strength: Very strong positive correlation
Is it a good prediction? Yes, by analyzing the scatter plot, a linear model seems to describe the data good enough. Also, the correlation coefficient is close to 1.
Edit.
Finally, by pressing the STAT button, and then selecting the menu item CALC, the option LinReg(ax+b)
can be found. This option gives a line of best fit, expressed as a linear function in slope-intercept form, and the correlation coefficient.
When rounding to three decimal places, the equation for the line of best fit and the correlation coefficient are expressed as follows.
Based on the correlation coefficient, and without looking at the scatter plot of the data, it can be concluded that the correlation is very strong and positive.
After analyzing the scatter plot, it can be concluded that the data is well described by a linear model. Therefore, the conclusion made in Part A is correct — the data has a very strong positive correlation. This would imply that the prediction of LeBron's weight is good enough. What clever work by Maya!
Maya's eSports company is interested in promoting their games through movies. Her supervisor has asked her to study movie releases so they can prepare a business strategy. The number of films released by 20th Century Fox from 2000 to 2018 is given in the following table.
Correlation Coefficient: r=-0.127
Strength: Very weak negative correlation
Is It Reliable? No, the correlation is very weak, as r=-0.127. Also, the scatter plot representing the given data and the scatter plot of the residuals reveal that a linear model is not appropriate to describe the data.
Edit, and introduce the data.
Next, press the STAT button, select the menu item CALC, and choose the option LinReg(ax+b)
. This option gives the line of best fit expressed as a linear equation in slope-intercept form. It also gives the correlation coefficient.
The scatter plot reveals that the data is not well described by a linear model. This could be expected since the correlation between the data is very weak. Consequently, the estimation made is not reliable. In fact, the actual number of films released by 20th Century Fox in 2019 was 10.
Additionally, by analyzing the scatter plot of the residuals, it can also concluded that a linear model is not appropriate to describe the data.
Note that there are points that are not so close to the x-axis. Hence, the line of best fit is not so reliable to make predictions.
There might be cases where a data set presents more than a correlation. For example, when the change in one variable directly affects the other, it is said that there is a causation.
However, it is unlikely that making more factories will cause an increase in the number of teachers. Thus, it can be said that there is no causal relationship. In this case, a third factor — population size — seems to directly affect the two data sets in question. That further provides evidence against a causal relationship between the number of factories and the number of teachers.
After the study came out showing a negative impact on grades because of too many hours playing video games, Maya's eSports company lost revenue. They will now try to reduce their electric bill for their lights, computers, and air conditioning. Maya created a table that shows the monthly average outdoor temperature and last year's prices.
January | February | March | April | May | June | July | August | September | October | November | December | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Outdoor Temperature (∘F) | 70 | 74 | 80 | 82 | 86 | 88 | 92 | 90 | 84 | 78 | 76 | 72 |
Bill ($) | 185 | 220 | 260 | 263 | 275 | 280 | 310 | 290 | 272 | 240 | 230 | 194 |
Maya is told that the office was open the same number of hours every month and the electrical appliances of the office always had the same settings throughout the entire year.
Edit, and enter the data.
Next, press the STAT button, select the menu item CALC, and choose the option LinReg(ax+b)
. This option gives the line of best fit which is expressed as a linear equation in slope-intercept form. It also gives the correlation coefficient.
The correlation coefficient is almost 1. That means that the data has a very strong positive correlation. Also, as seen in the scatter plot of the data, as the temperature increases, the price of the bill increases.
Notice that the correlation between the given data was very strong in the last two examples, Video Games vs. Grades and Temperature vs. the Electric Bill. However, a causal relationship could be established only in the second example. Therefore, it is always important to make a general study of the situation and look for hidden variables when analyzing data.
Although the app used for text messaging might have some kind of error that causes the phone to freeze, there is likely another cause for the freezing. For example, it could be that the phone is running several apps in the background, and it freezes due to the lack of Random-access memory (RAM).
Magdalena will create a model that estimates how much newborn girls weigh from birth until the first half-year.
To make the model, she collected some statistics from a stressed nurse at a hospital who has just weighed seven babies of different ages.
👶🏻 Newborn Girls 👶🏽 | |||||||
---|---|---|---|---|---|---|---|
Months | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
Weight (lbs) | 7.01 | 7.89 | 17.85 | 10.58 | 11.73 | 13.51 | 15.06 |
The line of best fit can be found using a graphing calculator. First, let's enter the data values into our calculator. This is done by pressing the STAT button and selecting the option Edit.
Next, we press the STAT button, and then select the menu item CALC. There, we can find the LinReg(ax+b)
option. This option gives us a line of best fit, expressed as a linear function in slope-intercept form, and the correlation coefficient.
By rounding each coefficient to two decimal places the equation for the line of best fit is expressed as follows. Line of best fit y = 1.05x + 8.81
The strength of the correlation is measured by the correlation coefficient. Let's recall the classifications.
Correlation Strength | |
---|---|
Value | Strength |
|r|=0 | No correlation |
0 < |r| < 0.2 | Very Weak |
0.2 ≤ |r| < 0.4 | Weak |
0.4 ≤ |r| < 0.6 | Moderate |
0.6 ≤ |r| < 0.8 | Strong |
0.8 ≤ |r| < 1 | Very Strong |
|r|=1 | Perfect |
From part A, we know that the correlation coefficient of the data collected by Magdalena is r=0.58. This means the strength of the correlation is moderate.
It is likely that Magdalena is not so happy with the line of best fit because the strength of the correlation is moderate. Maybe she expected a stronger correlation. Let's investigate a bit more and plot the line y=1.05x+8.81 along with the points in the coordinate plane.
As we can see, the point (2,17.85) is quite far from the line. In addition, this point represents a two-month-old baby weighing 17.85 pounds, which is pretty unlikely. This could also be the reason why Magdalena thinks there is something wrong with the data. Given that, let's remove this point and calculate the line of best fit again.
By removing the point (2,17.85), the strength of the correlation increased and gave us a line that better fits the rest of the data. y &= 1.35x+6.70 [0.25em] r &≈ 0.9963 → Very Strong
The following table shows the mileage x, in thousand of kilometers, and selling prices y, in thousands of dollars, for several used motorbikes of the same year and model.
🛵🛵 Motorbikes 🛵🛵 | ||||||
---|---|---|---|---|---|---|
Mileage (in thousands) | 29 | 19 | 15 | 36 | 10 | 25 |
Price (in thousands) | 8 | 11 | 13 | 5 | 18 | 9 |
The line of best fit can be found using a graphing calculator. To get started, let's enter the data values into our calculator. This is done by pressing the STAT button and selecting the option Edit.
Next, we press the STAT button, and then select the menu item CALC. There, we can find the LinReg(ax+b)
option. This option gives us a line of best fit, expressed as a linear function in slope-intercept form, and the correlation coefficient.
By rounding each coefficient to two decimal places, the equation for the line of best fit is expressed as follows. Line of best fit y = -0.46x+20.89
When we found the line of best fit in Part A, we also found the correlation coefficient. It is the r-value shown in the screen of the calculator.
By rounding the correlation coefficient to two decimal places, it is expressed as follows. r ≈ -0.97
Since r is negative, the correlation is negative. Additionally, note that r is very close to -1 which means that the correlation is very strong. That being said, the correlation is very strong and negative. This can also be appreciated graphically.
Diego, an appliance store owner, recently expanded his business to online shopping with free home delivery. After ten days, he checked the statistics of the articles shipped, however, he noticed that there were some data missing.
🚚🚚 Deliveries 🚚🚚 | ||||||
---|---|---|---|---|---|---|
Day, x | 1 | 2 | 5 | 6 | 8 | 10 |
Articles Shipped, y | 6 | 9 | 11 | 18 | 15 | 16 |
The line of best fit can be found using a graphing calculator. First, let's enter the data values into our calculator. This is done by pressing the STAT button and selecting the option Edit.
Next, we press the STAT button, and then select the menu item CALC. There, we can find the LinReg(ax+b)
option. This option gives us a line of best fit, expressed as a linear function in slope-intercept form, and the correlation coefficient.
By rounding each coefficient to two decimal places, the equation for the line of best fit is expressed as follows. Line of Best Fit y = 1.13x+6.48 Let's plot the line along with the given data on a coordinate plane.
When we found the line of best fit in Part A, we also found the correlation coefficient. It is the r-value shown in the screen of the calculator.
By rounding the correlation coefficient to two decimal places, it is expressed as follows. r ≈ 0.85
In order to estimate the number of articles shipped on the third day, we have to substitute 3 for x into the equation for the line of the best fit.
This means that Diego shipped 10 articles on the third day.
Paulina began managing The Beefy Lifters gym last week. To seek improvements for her clients, Paulina created a data sheet reflecting the influx of people who visit the gym each day.
💪🏽🤼🏋🏻 The Beefy Guys 🏋🏾🤼 | ||||||||
---|---|---|---|---|---|---|---|---|
Day, x | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
Number of People, y | 149 | 148 | 149 | 150 | 149 | 149 | 150 | 149 |
The line of best fit can be found using a graphing calculator. First, let's enter the data values into our calculator. This is done by pressing the STAT button and selecting the option Edit.
Next, we press the STAT button, and then select the menu item CALC. There, we can find the LinReg(ax+b)
option. This option gives us a line of best fit, expressed as a linear function in slope-intercept form, and the correlation coefficient.
By rounding each coefficient to two decimal places, the equation for the line of best fit is expressed as follows. Line of best fit y = 0.11x+148.64 Let's plot the line along with the given data on the coordinate plane.
In Part A, we already found the correlation coefficient. It is the r-value shown in the screen of the calculator.
By rounding the correlation coefficient to two decimal places, it is expressed as follows. r ≈ 0.41 As we can see, the correlation coefficient less than 0.6. This means that the strength of the correlation is moderate or weak.
To predict the day in which 151 people will visit the gym, we have to substitute 151 for y into the equation for the line of the best fit and solve it for x.
Therefore, 151 people are expected to go to the gym on the 21st day after Paulina started managing it. However, due to what we found in Part B, the correlation between the data is not as strong as we would like. This means that the predictions might not be as good as expected.
Consider the correlation coefficients 0.063, 0.959, -0.966, and 0.793. Pair each scatter plot with the most appropriate correlation coefficient.
To pair each scatter plot with its corresponding correlation coefficient, let's analyze each scatter plot one at a time.
In this scatter plot, the points seem to follow a certain direction. As x increases, y tends to decrease which implies a negative correlation.
In addition to the direction, we can see that the line of best fit has a negative slope, which confirms that the correlation coefficient is negative. Considering the given options, only r=-0.966 meets this condition.
A → r=-0.966
We can see that the points in this scatter plot do not seem to follow a particular direction — they are randomly distributed. Apart from the line of best fit, different lines can also be drawn, but none of them seem to describe the relationship between the points.
Therefore, there is no correlation between the points of Scatter Plot B. This implies that the correlation coefficient is close to 0. Considering the given options, only r=0.063 meets this condition.
B → r=0.063
The points in in this plot follow an ascending trajectory — as x increases, y tends to increase as well. This relationship implies a positive correlation.
Additionally, the line of best fit has a positive slope, confirming that the correlation coefficient is positive. Likewise, since all the points are close to the line, the correlation coefficient must be close to 1. Considering the given options, there are two possible values for r: 0.793 or 0.959. At this time, however, we cannot conclude which corresponds to Scatter Plot C.
rcc & & r = 0.793 & ↗ & C & & or & ↘ & & & r = 0.959
In this plot, as before, the points follow a ascending direction — as x increases, y tends to increase. This relationship implies a positive correlation.
Consequently, the correlation coefficient is positive. Once more, we have two possible options.
rcc & & r = 0.793 & ↗ & D& & or & ↘ & & & r = 0.959
When we compare Scatter Plots C and D, the points on D are more clustered and closer to the line of best fit than those on C. This implies that the correlation in D is stronger than the correlation in C. Therefore, r=0.959 is the most appropriate correlation coefficient for Scatter Plot D.
D → r = 0.959 C → r = 0.793
In the following table, each scatter plot is paired with its correlation coefficient, and the strength of the correlation is described.
Scatter Plot | Correlation Coefficient | Strength |
---|---|---|
A | -0.966 | Very strong negative correlation |
B | 0.063 | Very weak positive correlation |
C | 0.793 | Strong positive correlation |
D | 0.959 | Very strong positive correlation |