{{ 'ml-label-loading-course' | message }}

{{ tocSubheader }}

{{ 'ml-toc-proceed-mlc' | message }}

{{ 'ml-toc-proceed-tbs' | message }}

An error ocurred, try again later!

Chapter {{ article.chapter.number }}

{{ article.number }}. # {{ article.displayTitle }}

{{ article.intro.summary }}

Show less Show more Lesson Settings & Tools

| {{ 'ml-lesson-number-slides' | message : article.intro.bblockCount }} |

| {{ 'ml-lesson-number-exercises' | message : article.intro.exerciseCount }} |

| {{ 'ml-lesson-time-estimation' | message }} |

Analyzing and comparing data is as important as collecting it. This lesson covers basic data analysis concepts. It starts with finding a central value and then moves on to measuring how spread out the data points are. A good understanding of statistical measures will be achieved through this lesson.
### Catch-Up and Review

**Here is a recommended readings before getting started with this lesson.**

- What is Statistics?

Challenge

Emily and Ignacio love learning about animals. They believe they can make meaningful discoveries by studying data about any animal, beginning with cats. They choose to create a data set, consisting of seven data points, showing the lifespan of cats in their neighborhood. They surveyed their neighbors to get this information.

Lifespan of Cats (in years) | |||
---|---|---|---|

$15$ | $11$ | $14$ | $15$ |

$14$ | $17$ | $13$ |

Answer the following questions using this data set.

a What is the average lifespan of a cat?

{"type":"text","form":{"type":"math","options":{"comparison":"1","nofractofloat":false,"keypad":{"simple":true,"useShortLog":false,"variables":["x"],"constants":["PI"]}},"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><\/span><\/span>"},"formTextBefore":null,"formTextAfter":"years","answer":{"text":["14.1"]}}

b Which number, if any, occurs most frequently?

{"type":"text","form":{"type":"list","options":{"comparison":"1","nofractofloat":false,"keypad":{"simple":true,"useShortLog":false,"variables":["x"],"constants":["PI"]},"ordermatters":false,"numinput":2,"listEditable":true,"hideNoSolution":true},"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><\/span><\/span>"},"formTextBefore":null,"formTextAfter":"years","answer":{"text":["14","15"]}}

c Rearrange the data from least to greatest. What number is in the middle of this sorted data set?

{"type":"text","form":{"type":"math","options":{"comparison":"1","nofractofloat":false,"keypad":{"simple":true,"useShortLog":false,"variables":["x"],"constants":["PI"]}},"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><\/span><\/span>"},"formTextBefore":null,"formTextAfter":"years","answer":{"text":["14"]}}

Discussion

A data set is a collection of values that provides information. These values can be presented in various ways such as in numbers or categories. The values are typically gathered through measurements, surveys, or experiments. Consider a data set that consists of the heights of a group of actors.

Actor | Height |
---|---|

Madzia | $5ft$ $4in.$ |

Magda | $5ft$ $2in.$ |

Ignacio | $6ft$ $1.6in.$ |

Henrik | $5ft$ $10in.$ |

Ali | $6ft$ $1in.$ |

Diego | $5ft$ $2in.$ |

Miłosz | $5ft$ $2in.$ |

Paulina | $5ft$ $3in.$ |

Aybuke | $5ft$ $7in.$ |

Mateusz | $6ft$ $1.2in.$ |

Gamze | $5ft$ $3in.$ |

Marcin | $5ft$ $7in.$ |

Marcial | $5ft$ $8in.$ |

Heichi | $5ft$ $5in.$ |

Arkadiusz | $5ft$ $6in.$ |

Enrique | $5ft$ $10.5in.$ |

Aleksandra | $5ft$ $4in.$ |

Mateusz | $5ft$ $9in.$ |

Jordan | $5ft$ $5in.$ |

Paula | $5ft$ $2in.$ |

MacKenzie | $5ft$ $6in.$ |

Joe | $6ft$ $1in.$ |

Flavio | $5ft$ $10in.$ |

Jeremy | $5ft$ $4in.$ |

Umut | $6ft$ $1in.$ |

$Number of Observations:Number of Variables: 242 $

The actual number or category associated with each data point is called a Discussion

The mean, or the **average**, of a numerical data set is one of the measures of center. It is defined as the sum of all of the data values in a set divided by the number of values in the set.

$Mean=Number of ValuesSum of Values $

The following applet calculates the mean of the data set on the number line. Points can be moved to change the data values.

Discussion

The median is a measure of center that lies in the middle of a numerical data set when the data set is written in numerical order. When the the data set has an odd number of data points, the median is the value in the middle.

However, when the the data set has an even number of data points, the median is the average of the two middle numbers.

Discussion

The mode is a measure of center that shows the most common value in a data set. Modes can be used for both numerical and categorical data.
*does not* have a mode.

A data set can have more than one mode if two or more data values are equally common. However, if all values in the set only occur once, then the data set

Discussion

A measure of center, or a **measure of central tendency**, is a statistic that summarizes a data set by finding a central value. The most common measures of center are the mean, median, and mode.

Move the points around in the dot plot to generate new data. The applet identifies the mean, median, and mode of the data set.

Example

Ignacio volunteers at a dog shelter. He asks Emily to help him study a data set he made concerning the lifespan of some of the dogs. The information they gather will help the shelter! This time, the data set consists of eight data points rather than seven.

Lifespan of Dogs (in years) | |||
---|---|---|---|

$10$ | $21$ | $16$ | $15$ |

$13$ | $15$ | $17$ | $11$ |

a What is the mean of the data set?

{"type":"text","form":{"type":"math","options":{"comparison":"1","nofractofloat":false,"keypad":{"simple":true,"useShortLog":false,"variables":["x"],"constants":["PI"]}},"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><\/span><\/span>"},"formTextBefore":null,"formTextAfter":"years","answer":{"text":["14.75"]}}

b What is the median of the data set?

{"type":"text","form":{"type":"math","options":{"comparison":"1","nofractofloat":false,"keypad":{"simple":true,"useShortLog":false,"variables":["x"],"constants":["PI"]}},"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><\/span><\/span>"},"formTextBefore":null,"formTextAfter":"years","answer":{"text":["15"]}}

c What is the mode of the data set? {"type":"text","form":{"type":"math","options":{"comparison":"1","nofractofloat":false,"keypad":{"simple":true,"useShortLog":false,"variables":["x"],"constants":["PI"]}},"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><\/span><\/span>"},"formTextBefore":null,"formTextAfter":"years","answer":{"text":["15"]}}

a The mean of the data set is the sum of the data values divided by the number of data values.

b Order the data from least to greatest. What number is in the middle?

c The mode of a data set is the value that occurs most frequently.

a The mean of a data set is calculated by finding the sum of all values in the set and then dividing by the number of values in the set. In this case, there are $8$ values in the set.

$10,21,16,15,13,15,17,11 $

Add all values and divide the sum by $8.$
$Mean=Number of ValuesSum of Values $

SubstituteValues

Substitute values

$Mean=810+21+16+15+13+15+17+11 $

AddTerms

Add terms

$Mean=8118 $

CalcQuot

Calculate quotient

$Mean=14.75$

b Start by ordering the values from least to greatest.

$Unordered Data Set10,21,16,15,13,15,17,11⇓Ordered Data Set10,11,13,15,15,16,17,21 $

The number of values matters when determining the median. - For a set with an odd number of values, the median is the middle value.
- For a set with an even number of values, the median is the mean of the two middle values.

$Ordered Data Set10,11,13,15,15,16,17,21 $

Therefore, the median of the data set is the mean of $15$ and $15.$
The median of this data set is $15$ years.
c Remember that the mode of a data set is the value or values that occur most often. Take another look at the given data set.

$Ordered Data Set10,11,13,15,15,16,17,21 $

As seen, $15$ occurs two times and the rest of the numbers occurs only once. This means that the mode of the data set is $15$ years. Note that while the mean, median, and mode are close in this instance, they may vary in other cases.
Pop Quiz

Discussion

Similar to the measures of center, there are measures that describe how much the values in a data set differ from each other using only one measure. These measures summarize the spread of the data.

Concept

Range is a measure of spread that measures the difference between the maximum and minimum values of the data set.

Discussion

Quartiles are three values that divide a data set into four equal parts. The quartiles are denoted as $Q_{1},$ $Q_{2},$ and $Q_{3}.$ The second quartile $Q_{2},$ also known as the median, divides the ordered data set into two halves.

$Lower halfabc Q_{2}↑d Upper halfefg $

The median of the lower half is the first quartile $Q_{1},$ while the median of the upper half is the third quartile $Q_{3}.$ $Lower halfabc↓Q_{1} Q_{2}↑d Upper halfefg ↓Q_{3} $

The first quartile is also called lower quartile, and the third quartile is also called upper quartile. To find the quartiles of a data set, the values must first be written in numerical order. Discussion

The interquartile range, or **IQR**, of a data set is a measure of spread that measures the difference between $Q_{3}$ and $Q_{1},$ the upper and lower quartiles.

IQR$=Q_{3}−Q_{1}$

The following applet shows how to find the IQR of different data sets.

Discussion

The interquartile range (IQR) of a data set is found by first identifying the three quartiles and then calculating the difference between the third and the first quartile. Consider the following data set.
*expand_more*
*expand_more*
*expand_more*
*expand_more*

$1,3,4,4,5,6,6,8,8,10,10,11 $

The interquartile range of the data set can be found by following these four steps.
1

Identify the Median

First, identify the median of the given data set. Since the number of values is even, the median is the mean of the two middle values.

The median of the data is $6.$

2

Identify the Lower and the Upper Half of the Data Set

The median divides the data into two halves, a lower half and an upper half. For this data, the lower half includes the first six values and the upper half includes the following six.

When there is an odd number of values in the data set, the middle value is excluded from both the lower and upper sets.

3

Find the First and the Third Quartile

Find the first and the third quartile. The first quartile, $Q_{1},$ is the median of the lower set, while the third, $Q_{3},$ is the median of the upper set. Here, both quartiles are found the same way the median was found.

4

Calculate the Interquartile Range

The interquartile range is calculated by subtracting the first quartile, $Q_{1},$ from the third, $Q_{3}.$ For the given data set, the first quartile is $4$ and the third quartile is $9.$

$IQR =Q_{3}−Q_{1}=9−4=5 $

The interquartile range of the given data set is $5.$ Example

Ignacio and Emily enjoyed learning about cats and dog so much that they now want to compare the spread of one data set with the spread of another.
They collected a few more data points and compiled a data set consisting of nine data points for the weights of cats.
### Hint

### Solution

### Range for the Weights of Cats

The least and greatest values can be identified without sorting the data values. Note that they can be listed in order if desired. ### Range for the Weights of Dogs

Apply the same procedure of identifying the greatest and least values for the data set of dogs.
### Interquartile Range of Cat Weights

### Interquartile Range of Dog Weights

$Weights of Cats(lb)9,8,11,7,10,7,11,8,12 $

Then, they collected a data set of ten data points for the weights of dogs.
$Weights of Dogs(lb)11,18,29,32,32,35,37,44,55,79 $

a Which type of pet has a larger weight range: dogs or cats?

{"type":"choice","form":{"alts":["Cats","Dogs","Both have the same weight range"],"noSort":false},"formTextBefore":"","formTextAfter":"","answer":1}

b Find the interquartile range of each data set.

{"type":"text","form":{"type":"math","options":{"comparison":"1","nofractofloat":false,"keypad":{"simple":true,"useShortLog":false,"variables":["x"],"constants":["PI"]}},"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><\/span><\/span>"},"formTextBefore":"Interquartile Range of Cat Weights:","formTextAfter":"lb","answer":{"text":["3.5"]}}

{"type":"text","form":{"type":"math","options":{"comparison":"1","nofractofloat":false,"keypad":{"simple":true,"useShortLog":false,"variables":["x"],"constants":["PI"]}},"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><\/span><\/span>"},"formTextBefore":"Interquartile Range of Dog Weights:","formTextAfter":"lb","answer":{"text":["15"]}}

a The range of the data set is the difference between the greatest and least data values.

b The interquartile range is the distance between the first and the third quartiles of the data set.

a The range is one of the measures of spread. It is the difference between the maximum and minimum values of the data set. The range of each data set will be calculated individually.

$Weights of Cats(lb)9,8,11,7,10,7,11,8,12 $

The least value is $7$ and the greatest value is $12.$ The difference between these values is $12−7=5.$
$Range12−7=5lb $

The range for the weights of cats is $5$ pounds. $Weights of Dogs(lb)11,18,29,32,32,35,37,44,55,79 $

The least value is $11$ and the greatest value is $59.$ The difference between these values is $79−11=68.$ $Range79−11=68lb $

Dogs have a weight range of $68$ pounds. This far exceeds the $5-$pound range for cats.
b The interquartile range of each data set will be calculated individually.

Here, it is necessary to order the values from least to greatest. Then identify the median of the given data set. Since the number of values is an odd number, the median is the middle value.

The median of the data is $9.$ Both the lower and upper halves contain four data values. Therefore, there are two middle values in each half. The median of each half is the mean of the two middle values.

The first quartile is $7.5,$ and the third quartile is $11.$ The difference between the third quartile and the first quartile is the interquartile range. The interquartile range of cat weights is $3.5$ pounds.In this case, the data values are ordered from least to greatest and the number of values is an even number. This means that the median is the mean of the two middle values.

The median of the data is $33.5.$ Both the lower and upper halves contain five data values. Therefore, there is only one middle value in each half.

The first quartile is $29,$ and the third quartile is $44.$ The difference between the third quartile and the first quartile is the interquartile range. The interquartile range of dog weights is $15$ pounds.Discussion

A five-number summary of a data set consists of the following *five* values.

- Minimum value
- First quartile $Q_{1}$
- Median, or second quartile $Q_{2}$
- Third quartile $Q_{3}$
- Maximum value

These values provide a summary of the central tendency and spread of the data set. The five-number summary is useful for understanding the variability in a data set. When the data set is written in numerical order, the median divides the data set into two halves. The median of the lower half is the first quartile $Q_{1}$ and the median of the upper half is the third quartile $Q_{3}.$

Discussion

An outlier is a data point that is significantly different from the other values in the data set. It can be significantly larger or significantly smaller than the others.

Categorical data sometimes also have unusual elements; these can be called outliers as well.

However, it is best to use the term outlier when applying a mathematical process to identify it. This approach helps differentiate between an intuitive and a more formal approach.Significantly DifferentMean?

For numerical data, the following definition is one of the several approaches that can be used.

- A data value is an outlier —
*significantly different*from the other values — if it is farther away from the closest quartile than $1.5$ times the interquartile range.

Such a value was suggested by the esteemed American mathematician John Tukey. Move the slider in the following applet to see which data point is an outlier.

Example

Ignacio is relaxing, enjoying reviewing some data.
Wait a minute! There is something unusual about a data value in the data set for dogs.
### Hint

### Solution

This means that any value greater than $66.5$ is an outlier. Therefore, the value $79$ is an outlier.
### Finding Range

To find its range, subtract the smallest value from the greatest.
### Finding Interquartile Range

The interquartile range of the data when the outlier is taken out of the data set is $17.$

$Weights of Dogs(lb)11,18,29,32,32,35,37,44,55,79 $

a Identify the outlier in the data set.

{"type":"text","form":{"type":"math","options":{"comparison":"1","nofractofloat":false,"keypad":{"simple":true,"useShortLog":false,"variables":["x"],"constants":["PI"]}},"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><\/span><\/span>"},"formTextBefore":null,"formTextAfter":null,"answer":{"text":["79"]}}

b Find the range and interquartile range of the data set without the outlier.

{"type":"text","form":{"type":"math","options":{"comparison":"1","nofractofloat":false,"keypad":{"simple":true,"useShortLog":false,"variables":["x"],"constants":["PI"]}},"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><\/span><\/span>"},"formTextBefore":"Range:","formTextAfter":"lb","answer":{"text":["44"]}}

{"type":"text","form":{"type":"math","options":{"comparison":"1","nofractofloat":false,"keypad":{"simple":true,"useShortLog":false,"variables":["x"],"constants":["PI"]}},"text":"<span class=\"katex\"><span class=\"katex-html\" aria-hidden=\"true\"><\/span><\/span>"},"formTextBefore":"Interquartile Range:","formTextAfter":"lb","answer":{"text":["17"]}}

c Which measure does the outlier affect more?

{"type":"choice","form":{"alts":["Range","Interquartile Range"],"noSort":false},"formTextBefore":"","formTextAfter":"","answer":0}

a Is there a value that is larger or smaller than most values? Check if there is any value less than $Q_{1}−1.5IQR$ or greater than $Q_{3}+1.5IQR.$

b The range of a data set is the difference between the greatest and smallest data values. The interquartile range of a data set is the distance between the first and the third quartiles of the data set.

c Compare the range and interquartile range of the data set with and without the outlier.

a In the given data set, all values seem to be around the same number, except $79.$ This value seems to be *significantly different* from other values. Therefore, it is likely to be an outlier of the data set.

$Weights of Dogs(lb)11,18,29,32,32,35,37,44,55,79 $

To confirm that, check if it is farther away from the closest quartile by $1.5$ times the interquartile range. The first quartile of this data is $24,$ and the third quartile is $34.$
The interquartile range of this data set is $15.$ Now calculate $Q_{3}+1.5IQR.$
$Q_{3}+1.5IQR$

SubstituteII

$Q_{3}=44$, $IQR=15$

$44+1.5⋅15$

Multiply

Multiply

$44+22.5$

AddTerms

Add terms

$66.5$

b Exclude the outlier found in Part A from the data set.

$Weights of Dogs Without Outlier11,18,29,32,32,35,37,44,55 $

$Range55−11=44 $

After excluding the outlier, the number of values decreased by one. There are nine values now, so the median is the middle value.

The median of the data is $32.$ Both the lower and upper halves contain four data values. Therefore, there are two middle values in each half. The median of each half is the mean of the two middle values.

The first quartile is $23.5,$ and the third quartile is $40.5.$ The difference between the third quartile and the first quartile is the interquartile range.$IQR=Q_{3}−Q_{1}$

SubstituteII

$Q_{1}=23.5$, $Q_{3}=40.5$

$IQR=40.5−23.5$

SubTerm

Subtract term

$IQR=17$

c Consider the data once again, with and without outliers.

$Weights of Dogs11,18,29,32,32,35,37,44,55,79Weights of Dogs Without Outlier11,18,29,32,32,35,37,44,55 $

Summarize the results found in the previous parts. Range | IQR | |
---|---|---|

With Outliers | $68$ | $15$ |

Without Outliers | $44$ | $17$ |

After removing the outlier from the data, the range decreased from $68$ to $44,$ while the IQR increased from $15$ to $17.$ This example shows that outliers have a bigger impact on the range of values than on the IQR.

Pop Quiz

Measures of spread, such as the range and interquartile range, indicate how much data values varies, while outliers are values that significantly deviate from the rest. Practice calculating these measures for the given data.