| {{ 'ml-lesson-number-slides' | message : article.intro.bblockCount }} |
| {{ 'ml-lesson-number-exercises' | message : article.intro.exerciseCount }} |
| {{ 'ml-lesson-time-estimation' | message }} |
Here are a few recommended readings before getting started with this lesson.
When finding the line of best fit of a data set using a graphing calculator, it usually gives a constant r which measures the strength of the correlation. This constant is called the correlation coefficient.
Although there are different types of correlation coefficients, the most commonly used is the Pearson correlation coefficient.
Knowing the correlation coefficient can give a clue about how well the regression line models the data. Nevertheless, it is always best practice to make a scatter plot before drawing any conclusions. As an example, each set in the Anscombe's quartet has a correlation coefficient of 0.816, implying a strong positive correlation.
However, a linear model does not seem to fit the second set of data well.Pair each scatter plot with the most appropriate correlation coefficient.
Try drawing a line that fits each data set. If the points are grouped near the line, the correlation coefficient should be close either to 1 or -1. If the slope of the line is positive, the correlation coefficient is positive, and if the slope of the line is negative, the correlation coefficient is negative.
To pair each scatter plot with its corresponding correlation coefficient, analyze each scatter plot one at a time.
Scatter Plot A⟶r=0.08
In Scatter Plot B, the points seem to follow a certain direction. As x increases, y tends to decrease which implies a negative correlation. Here, a line with a negative slope can be drawn to model the data.
Hence, the correlation coefficient must also be negative. Considering the given options, there are two possible values for r: -0.92 or -0.65. At this time, however, it cannot be concluded which corresponds to Scatter Plot B.
The points in Scatter Plot C follow an ascending trajectory — as x increases, y tends to increase as well. This relationship implies a positive correlation. Here, a line with a positive slope seems to be a good candidate for the line of best fit.
Therefore, the correlation coefficient is positive. Likewise, since all the points are close to the line, the correlation coefficient must be close to 1. Consequently, the best choice for the correlation coefficient of Scatter Plot C is r=0.84.
Scatter Plot C⟶r=0.84
In Scatter Plot D, the points follow a descending direction — as x increases, y tends to decrease. This relationship implies a negative correlation. Like Scatter Plot B, a line with negative slope is a good candidate to be the line of best fit.
Therefore, the correlation coefficient is negative. As before, there are two options.
When comparing Scatter Plots B and D, the points on D are more clustered and closer to following a linear pattern than those on scatter plot B. Thus, the correlation in scatter plot D is stronger than the correlation in scatter plot B. Consequently, r=-0.92 is the most appropriate correlation coefficient for scatter plot D.
In the following table, each scatter plot is paired with its correlation coefficient, and the strength of the correlation is described.
Scatter Plot | Correlation Coefficient | Strength |
---|---|---|
A | 0.08 | Very weak positive correlation |
B | -0.65 | Strong negative correlation |
C | 0.84 | Very strong positive correlation |
D | -0.92 | Very strong negative correlation |
For each data set, find the correlation coefficient. Give the answer rounded to two decimal places.
Maya works at an eSports company that has offered a customized gaming chair for each player on the Los Angeles Lakers 2020–2021 season roster. The measurements of each player are listed in the following table.
Height (in) | 72.83 | 74.02 | 75.20 | 75.20 | 75.20 | 75.98 | 75.98 | 75.98 | 75.98 | 77.17 | 77.17 | 77.17 | 79.13 | 79.92 | 81.89 | 81.98 | 83.07 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Weight (lbs) | 178.57 | 189.60 | 182.98 | 198.42 | 198.42 | 194.01 | 205.03 | 233.69 | 218.26 | 178.57 | 213.85 | 196.21 | 235.90 | 213.85 | 251.33 | 264.56 | 264.56 |
Correlation Coefficient: r=0.848
Strength: Very strong positive correlation
Is it a good prediction? Yes, by analyzing the scatter plot, a linear model seems to describe the data good enough. Also, the correlation coefficient is close to 1.
Edit.
Finally, by pressing the STAT button, and then selecting the menu item CALC, the option LinReg(ax+b)
can be found. This option gives a line of best fit, expressed as a linear function in slope-intercept form, and the correlation coefficient.
When rounding to three decimal places, the equation for the line of best fit and the correlation coefficient are expressed as follows.
Based on the correlation coefficient, and without looking at the scatter plot of the data, it can be concluded that the correlation is very strong and positive.
After analyzing the scatter plot, it can be concluded that the data is well described by a linear model. Therefore, the conclusion made in Part A is correct — the data has a very strong positive correlation. This would imply that the prediction of LeBron's weight is good enough. What clever work by Maya!
Maya's eSports company is interested in promoting their games through movies. Her supervisor has asked her to study movie releases so they can prepare a business strategy. The number of films released by 20th Century Fox from 2000 to 2018 is given in the following table.
Correlation Coefficient: r=-0.127
Strength: Very weak negative correlation
Is It Reliable? No, the correlation is very weak, as r=-0.127. Also, the scatter plot representing the given data and the scatter plot of the residuals reveal that a linear model is not appropriate to describe the data.
Edit, and introduce the data.
Next, press the STAT button, select the menu item CALC, and choose the option LinReg(ax+b)
. This option gives the line of best fit expressed as a linear equation in slope-intercept form. It also gives the correlation coefficient.
The scatter plot reveals that the data is not well described by a linear model. This could be expected since the correlation between the data is very weak. Consequently, the estimation made is not reliable. In fact, the actual number of films released by 20th Century Fox in 2019 was 10.
Additionally, by analyzing the scatter plot of the residuals, it can also concluded that a linear model is not appropriate to describe the data.
Note that there are points that are not so close to the x-axis. Hence, the line of best fit is not so reliable to make predictions.
There might be cases where a data set presents more than a correlation. For example, when the change in one variable directly affects the other, it is said that there is a causation.
However, it is unlikely that making more factories will cause an increase in the number of teachers. Thus, it can be said that there is no causal relationship. In this case, a third factor — population size — seems to directly affect the two data sets in question. That further provides evidence against a causal relationship between the number of factories and the number of teachers.
After the study came out showing a negative impact on grades because of too many hours playing video games, Maya's eSports company lost revenue. They will now try to reduce their electric bill for their lights, computers, and air conditioning. Maya created a table that shows the monthly average outdoor temperature and last year's prices.
January | February | March | April | May | June | July | August | September | October | November | December | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Outdoor Temperature (∘F) | 70 | 74 | 80 | 82 | 86 | 88 | 92 | 90 | 84 | 78 | 76 | 72 |
Bill ($) | 185 | 220 | 260 | 263 | 275 | 280 | 310 | 290 | 272 | 240 | 230 | 194 |
Maya is told that the office was open the same number of hours every month and the electrical appliances of the office always had the same settings throughout the entire year.
Edit, and enter the data.
Next, press the STAT button, select the menu item CALC, and choose the option LinReg(ax+b)
. This option gives the line of best fit which is expressed as a linear equation in slope-intercept form. It also gives the correlation coefficient.
The correlation coefficient is almost 1. That means that the data has a very strong positive correlation. Also, as seen in the scatter plot of the data, as the temperature increases, the price of the bill increases.
Notice that the correlation between the given data was very strong in the last two examples, Video Games vs. Grades and Temperature vs. the Electric Bill. However, a causal relationship could be established only in the second example. Therefore, it is always important to make a general study of the situation and look for hidden variables when analyzing data.
Although the app used for text messaging might have some kind of error that causes the phone to freeze, there is likely another cause for the freezing. For example, it could be that the phone is running several apps in the background, and it freezes due to the lack of Random-access memory (RAM).
Also, it might be that the app used for text messaging is the one that Maya uses the most. If that is the case, it is more likely that when her phone freezes, she will be coincidentally using the text messaging app. To avoid misinterpreting causal relations, it is essential to always look for hidden variables before coming to a conclusion.