Unlocking the Secrets: Spread of Data and Outliers

Name	Height	Height in inches
Jahmani Hot Shot Swanson	4'5''	12* 4+5=53
Jonte Too Tall Hall	5'2''	12* 5+2=62
Donald Ducky Moore	6'0''	12* 6+0=72
Solomon Bam Bam Bamiro	6'5''	12* 6+5=77
Sean Elevator Williams	6'10''	12* 6+10=82
Paul Tiny Sturgess	7'8''	12* 7+8=92

Name

Height

Height in inches

Jahmani Hot Shot Swanson

4'5''

12* 4+5=53

Jonte Too Tall Hall

5'2''

12* 5+2=62

Donald Ducky Moore

6'0''

12* 6+0=72

Solomon Bam Bam Bamiro

6'5''

12* 6+5=77

Sean Elevator Williams

6'10''

12* 6+10=82

Paul Tiny Sturgess

7'8''

12* 7+8=92

	Mean	Median	Standard Deviation	Interquartile Range
Data A	20.15	20	5.13	7
Data B	25.36	28	8.54	10
Data C	20.58	21	9.92	19
Data D	7.26	6	7.52	3
Data E	21.05	22	11.57	20

Mean

Median

Standard Deviation

Interquartile Range

Data A

20.15

5.13

Data B

25.36

8.54

Data C

20.58

9.92

Data D

7.26

7.52

Data E

21.05

11.57

Data's Distribution	Some Observations	Preferred Statistics
Symmetric Distribution	Both the mean and the median are close to the center.	If the histogram has one peak, then both mean and median describe a typical data value. In this case, the preferred statistic is the mean, since it also considers the actual data values, not just the order.
Skewed Distribution or With Outliers	The extreme values can distort the mean and increase the standard deviation.	In this case, the preferred statistic to describe a typical data value is the median, since it only considers the order of the data. Furthermore, it is less sensitive to extreme values.

Data's Distribution

Some Observations

Preferred Statistics

Symmetric Distribution

Both the mean and the median are close to the center.

If the histogram has one peak, then both mean and median describe a typical data value. In this case, the preferred statistic is the mean, since it also considers the actual data values, not just the order.

Skewed Distribution or With Outliers

The extreme values can distort the mean and increase the standard deviation.

In this case, the preferred statistic to describe a typical data value is the median, since it only considers the order of the data. Furthermore, it is less sensitive to extreme values.

Possible Reason	Example
It can be a result of a data recording error. If this is obvious, then this data can be removed or modified.	Suppose a scientist records the length of some leaves from a tree in centimeters, measured to the nearest millimeter. In that case, a typical data entry has one decimal place. Yet, if the value 104 appears in the data, then it is likely that it was mistakenly recorded and meant to be recorded as 10.4.
The nature of the data is such that unusual entries can occur. In this case, this entry should also be considered. Still, usually, the median better describes a typical data value than the mean.	When looking at a data set of the heights of Star Wars characters, one should expect to see a few extremely low and high values.

Possible Reason

Example

It can be a result of a data recording error. If this is obvious, then this data can be removed or modified.

Suppose a scientist records the length of some leaves from a tree in centimeters, measured to the nearest millimeter. In that case, a typical data entry has one decimal place. Yet, if the value 104 appears in the data, then it is likely that it was mistakenly recorded and meant to be recorded as 10.4.

The nature of the data is such that unusual entries can occur. In this case, this entry should also be considered. Still, usually, the median better describes a typical data value than the mean.

When looking at a data set of the heights of Star Wars characters, one should expect to see a few extremely low and high values.

Consider the following four histograms.

Match the box plots to the correct histogram.

Let's start with histogram A. If a histogram is skewed right, the corresponding box plot will have more observations on its left. Notice the outlier on the right-hand side of the histogram. This corresponds to a value lying separately from the box plot. These observations suggest that A can be paired with II.

Examining histogram B, we notice that it is symmetrical around the middle. Since we have fewer observations at the edges of the histogram and more observations clustered around the middle, we get a shorter box plot. This corresponds to III.

In C, the observations are pushed off to the sides. When most observations are far away from the median, the box will be stretched. With this information, we can pair C with IV.

In D, observations are evenly distributed. When observations are evenly distributed, the box plot's four sections have roughly the same length, so we can pair D with I.

A restaurant wants to better understand how much their customers tip. During a given day, the manager records the given tips from 20 random customers and organized the observations in ascending order. &$0, $0, $0, $1, $2, $4, $4 &$4, $4, $5, $5, $6, $6, $7 &$8, $8, $9, $10, $12, $40

What is the median?

What is the lower quartile?

Which is the better measure of center for this distribution, the mean or the median?

Which is the better measure of dispersion, the interquartile range or the standard deviation?

The manager sees that the average tip is $7, but he realizes that this may not best represent the actual situation for his servers. Based on the data he gathered, he wonders if this average might be higher than the usual tip his servers receive. What could be the reason for this high average? What would be a more appropriate mean?

Since the data set is alerady written in ascending order, we can find the median immediately by determining the middle value. The data set has an even number of observations, 20, so the median will be the average of the 10^(th) and 11^(th) values. |c|c| [-1em] Observation(s) & Data [0.1em] [-0.8em] 1^(st) - 10^(th) & 0, 0, 0, 1, 3, 4, 4, 4, 4, 5 [1em] [-0.5em] Median & 5+5/2=5 [1em] [-0.8em] 11^(th) - 20^(th) & 5, 6, 6, 7, 8, 8, 9, 10, 12, 40 [0.3em] Therefore, the median tip is $5.

To find the lower quartile, we must divide the observations that are below the median into two equal halves. Notice that we again have an even number of observations, this time 10. Therefore, the lower quartile is the average of the 5^(th) and 6^(th) observations. |c|c| [-1em] Observation(s) & Data [0.1em] [-0.8em] 1^(st) - 5^(th) & 0, 0, 0, 1, 2 [0.2em] [-0.8em] Lower Quartile & 2+4/2=3 [0.8em] [-0.8em] 6^(st) - 10^(th) & 4, 4, 4, 4, 5 [0.2em] [-0.8em] Median & 5 [0.3em] [-1em] 11^(st) - 20^(th) & 5, 6, 6, 7, 8, 8, 9, 10, 12, 40 [0.3em] Consequently, the lower quartile is $3.

Depending on the distribution of the data set, the mean or the median could be a better measure of the center. We can use a histogram of the data to determine which is better. Let's first organize the data into intervals. |c|c|c| Interval & Observations & Count 0-5 & 0, 0, 0, 1, 2, [-0.5em] 4, 4, 4, 4 & 9 5-10 & 5, 5, 6, 6, 7, [-0.5em] 8, 8, 9 & 8 10-15 & 10, 12 & 2 15-20 & & 0 20-25& & 0 25-30 & & 0 30-35 & & 0 35-40 & & 0 40-45 & 40 & 1 Now we have all the information we need to draw the histogram.

Observing the histogram, we see that it is right-skewed and has an outlier at 40. Therefore, the center is best described by its median at $5 because it is not affected by the outlier.

Just like with the mean, the standard deviation takes all observations into account. The interquartile range (IQR) though is the difference between the upper quartile (Q_3) and the lower quartile (Q_1). IQR=Q_3-Q_1 This means that the interquartile range is unaffected by the outlier. Therefore, the interquartile range is a better measure.

As we found in Part D, we have an outlier in the data set that significantly increases the mean. If we remove the outlier, the mean would fall to something more representative for the population as a whole.

When we remove the outlier, the mean changes from $7 to $5. The manager was clever to notice this!

A teacher needs to arrange their students for a school play. In order to do so, they measure the students' heights and makes the following table. |c|c| Name & Height (cm) Heichi & 153 Maya & 144 Zain & 174 Claudia & 156 Mark & 176 Davontay & 163

Which measure better describes the observations, the median or the mean?

What is the mean?

What is the median?

There are certain conditions that would make the mean unrepresentative of a population. If the data set has a skew to one side or if there is an outlier present, the mean will not accurately represent the data set. To determine if this is the case, we will plot the observations on a number line.

As we can see, the observations are evenly distributed and there are no clear observations that seem to be outliers. Therefore, we can use either the median or the mean to accurately describe the data set.

To calculate the mean, we have to divide the sum of the observations by the number of observations.

The mean height of the students is 161 centimeters.

To find the median, the data set first has to be written in ascending order. 1^(st) 144, 2^(nd) 153, 3^(rd) 156, 4^(th) 163, 5^(th) 174, 6^(th) 176 The median is given by the value in the middle of the data set. However, the data set has an even number of observations. Therefore, the median will be the average of the 3^(rd) and 4^(th) observations. |c|c| -3pt Observations -3pt & -3pt Data [-0.8em] 1^(st) - 3^(rd) & -3pt 144, 153, 156 [0.2em] [-0.5em] Median & -3pt 156+163/2=159.5 [1em] [-0.8em] 4^(th) - 6^(th) & -3pt 163, 174, 176 [0.3em] Consequently, the median height is 159.5 centimeters.

Outliers

Catch-Up and Review

Effects of One Data Value's Change on a Data Set

Outlier

Extra

Identifying Outliers in a Data Set

Hint

Solution

Extra

Analyzing the Effect of Outliers on the Mean

Answer

Hint

Solution

Extra

Statistics of Data with Different Histogram Shapes

Hint

Solution

Extra

Connection Between Different Data Distributions and Best Statistics

Choosing the Best Statistics for Data Sets

Why Do Outliers Appear?

Extra

Outliers

Recommended exercises

	9 Theory slides
	8 Exercises - Grade E - A
	Each lesson is meant to take 1-2 classroom sessions