{{ 'ml-label-loading-course' | message }}

{{ tocSubheader }}

{{ 'ml-toc-proceed-mlc' | message }}

{{ 'ml-toc-proceed-tbs' | message }}

An error ocurred, try again later!

Chapter {{ article.chapter.number }}

{{ article.number }}. # {{ article.displayTitle }}

{{ article.intro.summary }}

Show less Show more Lesson Settings & Tools

| {{ 'ml-lesson-number-slides' | message : article.intro.bblockCount }} |

| {{ 'ml-lesson-number-exercises' | message : article.intro.exerciseCount }} |

| {{ 'ml-lesson-time-estimation' | message }} |

There are several ways to analyze how well a line of fit models a data set. In this lesson, the differences between observed data points and the points predicted by the line of fit will be calculated to determine if the regression line models the data well.
### Catch-Up and Review

**Here are a few recommended readings before getting started with this lesson.**

Explore

Consider the scatter plot of bivariate data. Move the the line in such a way that it models the data.

For the different data sets, draw the line of fit. How can it be determined if this line models the data well?

Discussion

One way to assess how well a line of fit describes the data is to analyze *residuals*.

Concept

A residual is the vertical distance between a data point and the line of fit. When a line of fit has been drawn on a scatter plot, not all of the data points lie exactly on the line — some of them are above the line and some below. Therefore, each data point has one residual, which can be positive, negative, or zero.

A residual can also be defined as the observed $y-$value of a data point minus its predicted $y-$value, found using the line of fit.

$Residual=Observedy-value −Predictedy-value $

The appropriate models can be determined by looking at the scatter plot of residuals.

- If the points in the plot are randomly placed about the $x-$axis, then a linear model describes the data set well.
- If some kind of pattern appears in the scatter plot, a non-linear model is more appropriate for the data.

Pop Quiz

The applet generates bivariate data and the equation of a line of fit for the data. Calculate the sum of the squares of the residuals for the given equation.

For a line of fit, large negative residuals are as bad as large positive ones. By squaring the residual values, positive and negative residuals are treated in the same way. A high sum of squares indicates high variability in the set of observations. Therefore, the lower the sum of squared residuals, the better the line of fit.

Example

The table below shows the finishing times, in seconds, for the Olympic gold medalist in the men's $100-$meter dash for the last six Olympic games. Olympic Game $1$ represents the $2000$ Olympic games and Olympic Game $6$ represents the $2020$ Olympic games.

Olympic Game | $1$ | $2$ | $3$ | $4$ | $5$ | $6$ |
---|---|---|---|---|---|---|

Finishing Time (sec) | $9.87$ | $9.85$ | $9.69$ | $9.63$ | $9.81$ | $9.80$ |

$Equation I:Equation II: y=-0.1x+10y=-0.05x+10 $

In these equations, $x$ represents the Olympic Game number as described above, and $y$ represents the finishing time in seconds. For each equation, calculate the sum of squared residuals. Which of the equations is a better fit?
b Make a scatter plot for the residuals.

a **Sum of Squared Residuals for Equation I: ** $0.2605$

**Which equation is a better fit?** Equation II

**Sum of Squared Residuals for Equation II: ** $0.077$

b **Graph:**

a Make a table of the residual values for each equation and then find the sum of the squares of the residuals.

b Plot the points $(x,residual)$ on a coordinate plane.

a For the given data set, there are two possible lines of fit.

$Equation I:Equation II: y=-0.1x+10y=-0.05x+10 $

In order to determine which line of fit is better, the residuals for both lines will be calculated first. For a data point, its residual is the difference between the $y-$value of the data point and the $y-$value predicted by the line of fit. $x$ | $y$ (Actual) | $y$ Predicted by $y=-0.1x+10$ | Residual for $y=-0.1x+10$ |
---|---|---|---|

$1$ | $9.87$ | $y=-0.1(1)+10=9.9$ | $9.87−9.9=-0.03$ |

$2$ | $9.85$ | $y=-0.1(2)+10=9.8$ | $9.85−9.8=0.05$ |

$3$ | $9.69$ | $y=-0.1(3)+10=9.7$ | $9.69−9.7=-0.01$ |

$4$ | $9.63$ | $y=-0.1(4)+10=9.6$ | $9.63−9.6=0.03$ |

$5$ | $9.81$ | $y=-0.1(5)+10=9.5$ | $9.81−9.5=0.31$ |

$6$ | $9.80$ | $y=-0.1(6)+10=9.4$ | $9.80−9.4=0.40$ |

$(-0.03)_{2}+0.05_{2}+(-0.01)_{2}+0.03_{2}+0.31_{2}+0.40_{2}$

CalcPow

Calculate power

$0.0009+0.0025+0.0001+0.0009+0.0961+0.16$

AddTerms

Add terms

$0.2605$

$x$ | $y$ (Actual) | $y$ Predicted by $y=-0.05x+10$ | Residual for $y=-0.05x+10$ |
---|---|---|---|

$1$ | $9.87$ | $y=-0.05(1)+10=9.95$ | $9.87−9.95=-0.08$ |

$2$ | $9.85$ | $y=-0.05(2)+10=9.9$ | $9.85−9.9=-0.05$ |

$3$ | $9.69$ | $y=-0.05(3)+10=9.85$ | $9.69−9.85=-0.16$ |

$4$ | $9.63$ | $y=-0.05(1)+10=9.8$ | $9.63−9.8=-0.17$ |

$5$ | $9.81$ | $y=-0.05(5)+10=9.75$ | $9.81−9.75=0.06$ |

$6$ | $9.80$ | $y=-0.05(6)+10=9.70$ | $9.80−9.70=0.10$ |

$(-0.08)_{2}+(-0.05)_{2}+(-0.16)_{2}+(-0.17)_{2}+0.06_{2}+0.10_{2}$

CalcPow

Calculate power

$0.0064+0.0025+0.0256+0.0289+0.0036+0.01$

AddTerms

Add terms

$0.077$

$SSR_{2}0.077 < SSR_{1}0.2605 $

b Recall the residuals for the equations found in the previous part.

$x$ | Residual for $y=-0.1x+10$ | Residual for $y=-0.05x+10$ |
---|---|---|

$1$ | $-0.03$ | $-0.08$ |

$2$ | $0.05$ | $-0.05$ |

$3$ | $-0.01$ | $-0.16$ |

$4$ | $0.03$ | $-0.17$ |

$5$ | $0.31$ | $0.06$ |

$6$ | $0.47$ | $0.1$ |

The points $(x,residual)$ for each equation will be graphed on a scatter plot.

As can be seen, the residuals for Equation II are close the $x-$axis and therefore the sum of their squares has a lower value.

Example

Maya is researching used cars similar to the one her older sister drives. She found some data showing the mileages $x,$ in thousands of miles, and the selling prices $y,$ in thousands dollars, of several used cars near her.

$x$ | $23$ | $12$ | $18$ | $30$ | $6$ | $26$ |
---|---|---|---|---|---|---|

$y$ | $15$ | $15$ | $17$ | $12$ | $19$ | $15$ |

**Example Equation:** $y=-0.25x+20$

Start by making a scatter plot for the given data set. Draw a line that passes through two data points and write its equation. Then, calculate the residuals for the equation.

First, the given data will be shown on a graph. Then, a line that passes close to the points will be drawn.
Next, using $m$ and either point, the equation of the line can be written in the point-slope form. Use $m=-31 $ and $(x_{1},y_{1})=(4,20).$
Finally, the residuals for this equation can be calculated. To do so, the difference between the $y-$value of a data point and the corresponding $y-$value predicted by the line of fit will be calculated for each data point.

Next, the sum of squared residuals can be found by adding the squares of the numbers in the last column of the table.
The sum of squared residuals for the equation is $16.$

Now, the sum of squared residuals for this equation can be calculated.
As a result, the sum of squared residuals for $y=-0.25x+20$ satisfies the condition.

It appears that the line passes through the points $(4,20)$ and $(16,16).$ Knowing two points on the line, the slope of the line and its equation can be found. To find the slope, the points need to be substituted into the Slope Formula.

$m=x_{2}−x_{1}y_{2}−y_{1} $

SubstitutePoints

Substitute $(4,20)$ & $(16,16)$

$m=16−416−20 $

▼

Evaluate right-hand side

SubTerm

Subtract term

$m=12-4 $

ReduceFrac

$ba =b/4a/4 $

$m=3-1 $

MoveNegNumToFrac

Put minus sign in front of fraction

$m=-31 $

$y−y_{1}=m(x−x_{1})⇓y−20=-31 (x−4) $

To write the equation in slope-intercept form, $y$ will be isolated.
$y−20=-31 (x−4)$

▼

Solve for $y$

Distr

Distribute $-31 $

$y−20=-31 x+34 $

AddEqn

$LHS+20=RHS+20$

$y=-31 x+34 +20$

NumberToFrac

$a=33⋅a $

$y=-31 x+34 +360 $

AddFrac

Add fractions

$y=-31 x+364 $

$x$ | $y$ (Actual) | $y$ Predicted by the equation | Residual |
---|---|---|---|

$6$ | $19$ | $y=-31 (6)+364 =358 $ | $19−358 =-31 $ |

$12$ | $15$ | $y=-31 (12)+364 =352 $ | $15−352 =-37 $ |

$18$ | $17$ | $y=-31 (18)+364 =346 $ | $17−346 =35 $ |

$23$ | $15$ | $y=-31 (23)+364 =341 $ | $15−341 =34 $ |

$26$ | $15$ | $y=-31 (26)+364 =338 $ | $15−358 =37 $ |

$30$ | $12$ | $y=-31 (30)+364 =334 $ | $12−334 =32 $ |

$(-31 )_{2}+(-37 )_{2}+(35 )_{2}+(34 )_{2}+(37 )_{2}+(32 )_{2}$

▼

Evaluate

NegBaseToPosBase

$(-a)_{2}=a_{2}$

$(31 )_{2}+(37 )_{2}+(35 )_{2}+(34 )_{2}+(37 )_{2}+(32 )_{2}$

PowQuot

$(ba )_{m}=b_{m}a_{m} $

$3_{2}1_{2} +3_{2}7_{2} +3_{2}5_{2} +3_{2}4_{2} +3_{2}7_{2} +3_{2}2_{2} $

CalcPow

Calculate power

$91 +949 +925 +916 +949 +94 $

AddFrac

Add fractions

$9144 $

CalcQuot

Calculate quotient

$16$

$Equationy=-31 x+364 Sum of Residuals16 $

This equation does not satisfy the condition that Maya wants. Another equation can be found by slightly increasing the slope of the above equation and slightly decreasing its $y-$intercept. Use the applet below to find an equation. For example, $y=-0.25x+20$ can be a good candidate.
The residuals for this equation can be found as follows.

$x$ | $y$ (Actual) | $y$ Predicted by the equation | Residual |
---|---|---|---|

$6$ | $19$ | $y=-0.25(6)+20=18.5$ | $19−18.5=0.5$ |

$12$ | $15$ | $y=-0.25(12)+20=17$ | $15−17=-2$ |

$18$ | $17$ | $y=-0.25(18)+20=15.5$ | $17−15.5=1.5$ |

$23$ | $15$ | $y=-0.25(23)+20=14.25$ | $15−14.25=0.75$ |

$26$ | $15$ | $y=-0.25(26)+20=13.5$ | $15−13.5=1.5$ |

$30$ | $12$ | $y=-0.25(30)+20=12.5$ | $12−12.5=-0.5$ |

$(0.5)_{2}+(-2)_{2}+(1.5)_{2}+(0.75)_{2}+(1.5)_{2}+(-0.5)_{2}$

▼

Evaluate

NegBaseToPosBase

$(-a)_{2}=a_{2}$

$(0.5)_{2}+(2)_{2}+(1.5)_{2}+(0.75)_{2}+(1.5)_{2}+(0.5)_{2}$

CalcPow

Calculate power

$0.25+4+2.25+0.5625+2.25+0.25$

AddTerms

Add terms

$9.5625$

$Equationy=-31 x+364 y=-0.25x+20 Sum of Residuals16>109.5625<10 $

Note that there are numerous equations that satisfy Maya's condition. Here only one of them was shown. The following applet shows the equation of a line fit and the sum of squared residuals. Use it to see how the sum changes as the equation changes. Pop Quiz

The applet generates bivariate data and two equations that model the data. To see the coordinates of a data point, move the cursor over it. Use the sum of squared residuals to determine the better fit for the data.

Closure

In this lesson, lines of fit and their residuals have been analyzed. When a line models a data set well, the sum of the squared residuals for the line is relatively small. Move the points and the line in the graph to see how the residuals and the sum of their squares change. *line of best fit* will be defined and its equation will be derived.

The question then arises as to whether there is a line whose sum of squared residuals is less than the sum of the squared residuals of any other line of fit. In the next lesson, the

Loading content