{{ 'ml-label-loading-course' | message }}

{{ tocSubheader }}

{{ 'ml-toc-proceed-mlc' | message }}

{{ 'ml-toc-proceed-tbs' | message }}

An error ocurred, try again later!

Chapter {{ article.chapter.number }}

{{ article.number }}. # {{ article.displayTitle }}

{{ article.intro.summary }}

Show less Show more Lesson Settings & Tools

| {{ 'ml-lesson-number-slides' | message : article.intro.bblockCount }} |

| {{ 'ml-lesson-number-exercises' | message : article.intro.exerciseCount }} |

| {{ 'ml-lesson-time-estimation' | message }} |

This lesson investigates how characteristics like the center and spread of data distributions help compare data sets. The learner will be able to make meaning of real-life data sets on several topics like animals, weather patterns, sports, and even a part of the U.S. stock market! ### Catch-Up and Review

**Here are a few recommended readings before getting started with this lesson.**

- These are measures of the center of a data set.
- These are measures of the spread of a data set.

$12.7,4.7,14.4, 12.2,12.1,8.1, 6.8,16.6,15.7, 13.3,5.5,12.6, 11.5,11.1,8.8 $

a Find the mean and give the answer as a decimal rounded to one decimal place.

{"type":"text","form":{"type":"math","options":{"comparison":"1","nofractofloat":false,"keypad":{"simple":true,"useShortLog":false,"variables":[],"constants":[]}},"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><\/span><\/span>"},"formTextBefore":null,"formTextAfter":null,"answer":{"text":["11.1"]}}

b Find the median and give the answer as a decimal rounded to one decimal place.

{"type":"text","form":{"type":"math","options":{"comparison":"1","nofractofloat":false,"keypad":{"simple":true,"useShortLog":false,"variables":[],"constants":[]}},"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><\/span><\/span>"},"formTextBefore":null,"formTextAfter":null,"answer":{"text":["12.1"]}}

c Find the range and give the answer as a decimal rounded to one decimal place.

{"type":"text","form":{"type":"math","options":{"comparison":"1","nofractofloat":false,"keypad":{"simple":true,"useShortLog":false,"variables":[],"constants":[]}},"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><\/span><\/span>"},"formTextBefore":null,"formTextAfter":null,"answer":{"text":["11.9"]}}

d Find the interquartile range and give the answer as a decimal rounded to one decimal place.

{"type":"text","form":{"type":"math","options":{"comparison":"1","nofractofloat":false,"keypad":{"simple":true,"useShortLog":false,"variables":[],"constants":[]}},"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><\/span><\/span>"},"formTextBefore":null,"formTextAfter":null,"answer":{"text":["5.2"]}}

e Find the mean absolute deviation and give the answer as a decimal rounded to two decimal places.

{"type":"text","form":{"type":"math","options":{"comparison":"1","nofractofloat":false,"keypad":{"simple":true,"useShortLog":false,"variables":[],"constants":[]}},"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><\/span><\/span>"},"formTextBefore":null,"formTextAfter":null,"answer":{"text":["2.86"]}}

Challenge

In $1917,$ Pekka Brofeld published the measurements of various fish species caught in a lake near Tampere, Finland. Representing part of the findings is a data set about the length and height of four of the fish species caught. Move the slider to see the complete data set for these four fish.
The answer may not be so clear. This lesson will explain how analyzing a data set using various methods can help identify pairs, such as the challenge, correctly. To do so, measures of center — like the median and mean — will be presented and applied to various data sets throughout the lesson.

Pictured are the four species and their Finnish names. Note that the scale of the four drawings are not equal to one another.

Match the Finnish name with the Latin name.

{"type":"pair","form":{"alts":[[{"id":0,"text":"Abramis bjorkna"},{"id":1,"text":"Esox lucius"},{"id":2,"text":"Leuciscus rutilus"},{"id":3,"text":"Osmerus eperlanus"}],[{"id":0,"text":"Pasuri"},{"id":1,"text":"Hauki"},{"id":2,"text":"S\u00e4rki"},{"id":3,"text":"Norssi"}]],"lockLeft":true,"lockRight":false},"formTextBefore":"","formTextAfter":"","answer":[[0,1,2,3],[0,1,2,3]]}

Example

The following box plots show the distribution of the heights (in feet and inches) of the players on the Ohio State Buckeyes men's basketball and football teams in the $2020–2021$ season.

Considering the chart, match each respective box plot with the correct team.{"type":"pair","form":{"alts":[[{"id":0,"text":"Team A"},{"id":1,"text":"Team B"}],[{"id":0,"text":"Football"},{"id":1,"text":"Basketball"}]],"lockLeft":true,"lockRight":false},"formTextBefore":"","formTextAfter":"","answer":[[0,1],[0,1]]}

Of the two sports, which tends to place more importance on a player's height? Identify the maximum and the median heights of each team.

The box plots show that the range of heights is similar for both teams, but Team B, on average, has taller players.

- The tallest player on Team B is $6$ feet $10$ inches tall, and the median height is $6$ feet $6$ inches.
- The tallest player on Team A is $6$ feet $8$ inches tall, and the median height is $6$ feet $2$ inches.

Height tends to be more advantageous in basketball than in football. Therefore, it is reasonable to conclude from the box plots that Team B is the basketball team and Team A is the football team.

Represented using a histogram, the same data set is used to show the distribution of the height of the players on the two teams.

Example

The table below shows the average monthly high temperatures across three small towns.

One town is located in the State of Alaska, another in Florida, and the other in Nebraska.

Analyzing the data set and map, try to match each town with the correct corresponding state. Note that, generally, northern states tend to be colder than southern states.

{"type":"pair","form":{"alts":[[{"id":0,"text":"Nehawka"},{"id":1,"text":"Noma"},{"id":2,"text":"Mekoryuk"}],[{"id":0,"text":"Nebraska"},{"id":1,"text":"Florida"},{"id":2,"text":"Alaska"}]],"lockLeft":true,"lockRight":false},"formTextBefore":"","formTextAfter":"","answer":[[0,1,2],[0,1,2]]}

Think about the relationship between the location and climate of each state of Alaska, Nebraska, and Florida. Then, consider each month's average high temperatures as shown in the data set. Which town is the warmest? Which town is the coldest?

Investigating the given data set and map, the following observations can be made.

- When comparing each town's average monthly high temperatures, the weather is coldest in Mekoryuk and warmest in Noma.
- Next, observing the map with a keen eye, Florida is the furthest south of these three states. It is likely — on average — to be the warmest. Therefore, it would make sense to match Florida with the town with the highest average monthly temperatures as described in the table.
- Again, observing the map, Alaska is the furthest north of the three states. It is likely — on average — to be the coldest. Therefore, it would make sense to match Alaska with the lowest average monthly temperatures as described in the table.

Considering each observation from the data set and map, it is likely that Noma is in Florida, Mekoryuk is in Alaska, and Nehawka is in Nebraska.

Pop Quiz

The following applet shows the histograms of two data sets. Move the slider to investigate the observations separately.

Discussion

When two data sets have similar centers, investigating the spread of each data set can be useful in highlighting their differences. One way of measuring spread is to calculate the average of each data value's distance from the mean. This measure is called the mean absolute deviation.
## Standard Deviation

Graphing calculators can find the standard deviation, but they do not show the *mean absolute deviation*.
It is interesting to compare the value of these two measures.
### Extra

Comparison of Standard Deviation and Mean Absolute Deviation
In the examples to follow, measures of spread like the range and the standard deviation will be used to compare data sets.

$n∣x_{1}−xˉ∣+∣x_{2}−xˉ∣+⋯+∣x_{n}−xˉ∣ $

Due to the difficulty of making calculations using the absolute value, a commonly used alternate approach to measuring the spread is to calculate the standard deviation. Concept

The standard deviation is a measure of spread of a data set that measures how much the data elements differ from the mean. The standard deviation, often represented by the Greek letter $σ$ (sigma), is calculated by taking the square root of the variance of the data set. Let $x_{1},x_{2},…,x_{n}$ be the data values in a set and $x$ their mean.

$σ=n(x_{1}−x)_{2}+(x_{2}−x)_{2}+⋯+(x_{n}−x)_{2} $

The applet below calculates the standard deviation for the data set on the number line. Move the points around to change the data.
As shown, finding the standard deviation involves calculating the average of the squared differences between each data point and the mean, and then taking the square root of that average. The sum of squares can also be written in sigma notation.

$σ=ni=1∑n (x_{i}−x)_{2} $

Standard deviation is sensitive to outliers because of the squaring of differences. It is commonly used when analyzing a data set that exhibits a normal distribution.The standard deviation is greater or equal to the mean absolute deviation.
The proof of this relationship is based on the following inequality, which is true for any two positive numbers.
The proof of the inequality involving the means of more than two numbers is similar.

$2a+b ≤2a_{2}+b_{2} $

The inequality can be proved as follows.
$0≤(a−b)_{2}$

ExpandNegPerfectSquare

$(a−b)_{2}=a_{2}−2ab+b_{2}$

$0≤a_{2}−2ab+b_{2}$

AddIneq

$LHS+a_{2}+2ab+b_{2}≤RHS+a_{2}+2ab+b_{2}$

$a_{2}+2ab+b_{2}≤2a_{2}+2b_{2}$

DivIneq

$LHS/4≤RHS/4$

$4a_{2}+2ab+b_{2} ≤42a_{2}+2b_{2} $

FactorOut

Factor out $2$

$4a_{2}+2ab+b_{2} ≤42(a_{2}+b_{2}) $

FacPosPerfectSquare

$a_{2}+2ab+b_{2}=(a+b)_{2}$

$4(a+b)_{2} ≤42(a_{2}+b_{2}) $

SimpQuot

Simplify quotient

$4(a+b)_{2} ≤2a_{2}+b_{2} $

SqrtIneq

$LHS ≤RHS $

$2a+b ≤2a_{2}+b_{2} $

Example

The table below shows the average monthly low temperatures of two cities — Kansas City and Seattle. The two cities given are not necessarily in order.

According to the data set, City A and B have annual average low temperatures around $45_{∘}F$ and $43_{∘}F,$ respectively. Referencing the map below, Seattle is located much further north than Kansas City. It is typical that northern states — on average — are colder than southern states. Nevertheless, Seattle experiences less variance in temperature changes during each season due to the ocean's tempering effect on the climate.

Use the ranges and standard deviations of the data set, along with the given geographical information, to determine which cities are pairs.{"type":"pair","form":{"alts":[[{"id":0,"text":"City A"},{"id":1,"text":"City B"}],[{"id":0,"text":"Seattle"},{"id":1,"text":"Kansas City"}]],"lockLeft":true,"lockRight":false},"formTextBefore":"","formTextAfter":"","answer":[[0,1],[0,1]]}

Think of the climate similarities and differences of coastal areas compared to inland areas. Find the range and standard deviation of the temperatures.

Using the range and standard deviation — measures of spread — will help compare the two cities' average low temperatures. A graphing calculator can be used to find the standard deviation.

Range | Standard Deviation | |
---|---|---|

City A | $55−36=19$ | $6.7$ |

City B | $66−18=48$ | $16.4$ |

These measures of spread show that the temperature throughout the year changes much less in City A than in City B. Based on that analysis, and considering the information given about the tempering effect of the ocean, it is reasonable to conclude that City A is Seattle and City B is Kansas City. What a cool conclusion to make.

Example

In the US stock market, a measure of how much a stock price fluctuates during a certain period of time is called historical volatility. The following data set from the year $2020$ contains information about the daily closing stock price (in dollars) of two companies.

Low | Mean | High | Standard Deviation | |
---|---|---|---|---|

APDN | $$2.52$ | $$6.89$ | $$15.21$ | $$2.24$ |

DSS | $$4.04$ | $$6.90$ | $$10.89$ | $$1.69$ |

Which stock price was less volatile in $2020?$

{"type":"choice","form":{"alts":["APDN","DSS"],"noSort":true},"formTextBefore":"","formTextAfter":"","answer":1}

Which part of the table gives information about the spread of the stock price?

The numbers in the given table can be interpreted into the following sentences.

- The two companies' mean stock price was very close for the year — separated merely by one-cent!
- The range for APDN is the high of $$15.21$ minus the low of $$2.52,$ which is $$12.69.$ In comparison, the DSS range is the high of $$10.89,$ minus the low of $$4.04,$ which is $$6.85.$
- The standard deviation for DSS, $$1.69,$ is smaller than the standard deviation for APDN, $$2.24.$

Both the range and standard deviation are smaller for DSS. These interpretations indicate that DSS's stock price fluctuated less over the year than the stock price of APDN. Therefore, it can be concluded that the stock price of DSS was less volatile in $2020.$

The graph below shows the change of the closing stock prices of the two companies during $2020$ along with the ranges and the means.

The histogram below shows the distribution of the stock prices.

Pop Quiz

The following applet shows the histograms of two data sets. Move the slider to investigate the data sets separately. Then, answer the given questions based on visual observations.

Sometimes, instead of calculating the numerical statistics, visual inspection of the data distribution can tell the difference between data sets. This skill can come in handy when reading magazine articles and posts online, for example.

Example

Consider the following two histograms where neither the labels nor scales are specified.

Both of these histograms represent different distributions, and both have $26$ columns.

- One histogram shows the distribution of the numbers drawn in the Powerball lottery game since the latest version was introduced in $2015.$ In this version of the game, the randomly drawn Powerball number is between $1$ and $26.$
- The other histogram shows the distribution of AMC $8$ (American Mathematics Competition) results in $2020.$ In this competition, participants had $25$ multiple-choice questions in which a student could answer between $0$ and $25$ correctly. The histogram shows how many correct answers the participants received.

{"type":"pair","form":{"alts":[[{"id":0,"text":"Histogram A"},{"id":1,"text":"Histogram B"}],[{"id":0,"text":"AMC <span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"><\/span><span class=\"mord\">8<\/span><\/span><\/span><\/span>"},{"id":1,"text":"Powerball lottery"}]],"lockLeft":true,"lockRight":false},"formTextBefore":"","formTextAfter":"","answer":[[0,1],[0,1]]}

In a lottery, all numbers are drawn with equal probability.

In a mathematics competition, only a few students answer all questions correctly. Still, a lot of students will be able to answer at least some of the questions correctly. Consequently, it is likely that the histogram is shaped like a mountain, with a peak in the center and low ends. The shape of Histogram A reflects this behavior.

In a lottery, all numbers are drawn with equal probability, so in the long run, it can be expected that there is little difference between the frequencies. The shape of Histogram B reflects this.

There is even more fascinating information to be discovered from the shapes of the histograms.

The height of the lone tall bar furthest to the left in Histogram A shows that in the AMC $8$ competition, there were plenty of participants in $2020$ who did not answer a single question correctly! Well, it is much more likely, however, that these participants registered but did not attend the competition.

Histogram A's peak shows that in $2020,$ on average, students in the AMC $8$ competition answered less than half of the questions correctly.

The fluctuation of the bar heights in Histogram B shows that although an even distribution of the numbers is expected on the Powerball draw, some numbers historically came out fewer times.

The bar corresponding to $24$ is more than twice as high as the bar corresponding to $16.$ However, this does not mean that $24$ is twice as likely to come out in a draw. Nor does this mean that players should now play $16$ because it will eventually catch up. The data is historical; it does not have any effect on the next draw.

Closure

At the beginning of the lesson, a data set of fish species was presented. The task was to match the Latin and Finnish names of four fish species by analyzing and comparing the fish drawings with a data set of fish lengths and heights.

Each pair of the longest fish and tallest fish are too close in value to reliably distinguish between the species. Therefore, using the tables, matching the names in corresponding order seems like a natural method to find the conclusion.

This example shows that while data contains valuable information, it needs to be treated carefully. Conclusions based on the evaluation of some data are not always accurate. Therefore, statisticians will typically use additional measures to calculate how confident they can be in their conclusion(s) gathered from the methods used in this lesson.

Since the fish drawings are not equal in scale, inspecting the length and height of the fish directly from the images is not helpful by itself. Using the data set, however, the ratio of length to height for each species can be computed to provide information about the fish shape. Then it can be compared to an analysis of the drawings.

The mean of the length to height ratios for all species should provide a good understanding of the shape of the species. The mean can be found using a graphing calculator, a spreadsheet software, or by hand. To do so by hand, add the values in each column and divide by the number of data points. The results, in increasing order, are summarized in the following table.

Mean Length to Height Ratio | |
---|---|

Abramis Bjorkna | $2.55$ |

Leuciscus Rutilus | $3.75$ |

Osmerus Eperlanus | $5.95$ |

Esox Lucius | $6.33$ |

Next, the actual drawings can be used to find their length to height ratios. This measurement, however, is in pixels instead of centimeters. Most image software on a standard computer can show these measurements. Here, they are given. Recall that the drawings use the Finnish names.

External credits: Couch, Jonathan (Freshwater and Marine Image Bank)

External credits: Couch, Jonathan (Freshwater and Marine Image Bank)

External credits: Seeley, H.G. (Freshwater and Marine Image Bank)

External credits: Seeley, H. G. (Freshwater and Marine Image Bank)

The results, in increasing order, can be summarized as follows.

Length to Height Ratio (Images) | |
---|---|

Pasuri | $120361 ≈3.01$ |

Särki | $106393 ≈3.71$ |

Hauki | $59358 ≈6.07$ |

Norssi | $55358 ≈6.51$ |

The numbers in the two tables do not match exactly, which would make sense given that they are measured using different measurements, and the images are not matching in scale. Still, in both tables, two species have a ratio above $5$ and two species have a ratio below $4.$ That means the following distinction can be made.

Latin Name (Data Set) | Finnish Name (Images) | |
---|---|---|

Longer Fishes | Osmerus eperlanus and Esox lucius | Hauki and Norssi |

Taller Fishes | Abramis bjorkna and Leuciscus rutilus | Pasuri and Särki |

$Abramis BjorknaLeuciscus RutilusOsmerus EperlanusEsox Lucius ⟷⟷⟷⟷ PasuriSa¨ rkiHaukiNorsi ✓✓×× $

These pairings, however, are not entirely correct. While the taller fishes are paired correctly, the longer fishes are not matched correctly. The correct matches are shown in the table below, which just for reference also includes the English names. Latin Name | Finnish Name | English Name |
---|---|---|

Abramis Bjorkna | Pasuri | Bream |

Leuciscus Rutilus | Särki | Roach |

Osmerus Eperlanus | Norssi | Smelt |

Esox Lucius | Hauki | Pike |

Loading content