Sign In
| 9 Theory slides |
| 8 Exercises - Grade E - A |
| Each lesson is meant to take 1-2 classroom sessions |
Here are a few recommended readings before getting started with this lesson.
Try a few practice exercises before beginning the lesson. Consider the following data set.For each practice exercise give the answer rounded to two decimal places.
The dot plot below shows the distribution of a data set.
An outlier is a data point that is significantly different from the other values in the data set. It can be significantly larger or significantly smaller than the others.
Categorical data sometimes also have unusual elements; these can be called outliers as well.
However, it is best to use the term outlier when applying a mathematical process to identify it. This approach helps differentiate between an intuitive and a more formal approach.
What significantly different
means can depend on many things. For numerical data, the following definition is one of the several approaches that can be used. A data value can be considered an outlier if it is farther away from the closest quartile than a certain multiple of the interquartile range.
The box plot below shows the distribution of the heights (in inches) of all the players who have ever played for the Harlem Globetrotters basketball team. The heights of six of the players are indicated on the number line with dots.
Select all the outliers from the given list.
Compare the length of the whiskers to the width of the box.
Since the scale on the box plot is given in inches, first the given height data should be converted to inches.
Name | Height | Height in inches |
---|---|---|
Jahmani Hot ShotSwanson |
4′5′′ | 12⋅4+5=53 |
Jonte Too TallHall |
5′2′′ | 12⋅5+2=62 |
Donald DuckyMoore |
6′0′′ | 12⋅6+0=72 |
Solomon Bam BamBamiro |
6′5′′ | 12⋅6+5=77 |
Sean ElevatorWilliams |
6′10′′ | 12⋅6+10=82 |
Paul TinySturgess |
7′8′′ | 12⋅7+8=92 |
Heights in the boxed section of the chart can be considered typical; they are not outliers. Heights close to the box are not typical; still, they are not extreme.
Bam BamBamiro is the median height. This height is not an outlier.
DuckyMoore is a bit less than the first quartile but not far away. This height is not an outlier.
ElevatorWilliams is a bit more than the third quartile, but not far away. This height is not an outlier.
Hot ShotSwanson and Jonte
Too TallHall are much less than the first quartile. These heights are outliers.
TinySturgess is much more than the third quartile. This height is an outlier.
Sometimes box plots are drawn in a way that highlights outliers. In this case, the whiskers are only drawn to the last data value that is not considered an outlier, and the outliers are indicated separately.
The box plot in this example was drawn based on the data values of 696 players' heights who played for the Harlem Globetrotters over the years. Here is a list of all of the players who are classified as unusually short or tall compared to all of the other players.
Hot ShotSwanson (4′5′′)
X-OverTompkins (4′6′′)
Too TallHall (5′2′′)
TorchGeorge (5′3′′)
Pee WeeHenry (5′3′′)
TinySturgess (7′8′′)
In the last box plot, the height 5′3′′ was classified as an outlier, but the height 5′4′′ was not. The reason for this is that a graphing calculator was used, and it applied its own methodology. The histogram can give more details than a box plot and can indicate a different approach to classifying outliers.
In this context, it can be argued that only the heights below 60 inches are considered to be outliers on the low end of the data.
Identifying outliers is not a strict process. Context can modify what the generally accepted numerical method indicates.
Outlier: 4.88
Mean With Outlier: 5.45
Mean Without Outlier: 5.47
Note that one data value appears to be much less than the average of the others.
All values are above 5 except one, that is 4.88. This observation can be confirmed by placing all values on a number line.
It is interesting to see what the original publication says about the mean and how this mean compares to the value accepted today.
Consider the five data sets illustrated by the histograms and the corresponding box plots.
The histogram for Data B is skewed to the left. The outliers to the left of the distribution bring the mean a bit to the left of a typical data value. Note that the box plot shows that the median is close to the peak of the distribution.
The histogram for Data D is skewed to the right, with only a few extreme outliers on the right of the distribution. These outliers bring the mean a bit to the right of a typical data value. Note that the box plot shows that the median is close to the peak of the distribution.
For Data E, there is no clear peak in the histogram. This characteristic represents a uniform distribution where all values of the range are expected to appear with approximately the same frequency. Therefore, neither the mean nor median of Data E are good indicators of a typical data value.
The table below contains the mean, the median, the standard deviation, and the interquartile range for all five distributions.
Mean | Median | Standard Deviation | Interquartile Range | |
---|---|---|---|---|
Data A | 20.15 | 20 | 5.13 | 7 |
Data B | 25.36 | 28 | 8.54 | 10 |
Data C | 20.58 | 21 | 9.92 | 19 |
Data D | 7.26 | 6 | 7.52 | 3 |
Data E | 21.05 | 22 | 11.57 | 20 |
The following observations can be summarized from the previous example.
Data's Distribution | Some Observations | Preferred Statistics |
---|---|---|
Symmetric Distribution | Both the mean and the median are close to the center. | If the histogram has one peak, then both mean and median describe a typical data value. In this case, the preferred statistic is the mean, since it also considers the actual data values, not just the order. |
Skewed Distribution or With Outliers | The extreme values can distort the mean and increase the standard deviation. | In this case, the preferred statistic to describe a typical data value is the median, since it only considers the order of the data. Furthermore, it is less sensitive to extreme values. |
When analyzing the center of the data, it is often very useful to investigate the measure of spread too. Here are two examples of how different statistics can be paired.
There are several statistics used to describe a data set: mean, median, standard deviation, and interquartile range. Some are more useful than others depending on the case. Analyze the shape of the following histogram of a data set. Then, select the most appropriate pair of statistics that would best describe it. Try out a few!
As it was previously noted, outliers are characterized as unusual values in a data set. Outliers can appear for several reasons.
Possible Reason | Example |
---|---|
It can be a result of a data recording error. If this is obvious, then this data can be removed or modified. | Suppose a scientist records the length of some leaves from a tree in centimeters, measured to the nearest millimeter. In that case, a typical data entry has one decimal place. Yet, if the value 104 appears in the data, then it is likely that it was mistakenly recorded and meant to be recorded as 10.4. |
The nature of the data is such that unusual entries can occur. In this case, this entry should also be considered. Still, usually, the median better describes a typical data value than the mean. | When looking at a data set of the heights of Star Wars characters, one should expect to see a few extremely low and high values. |
As a concluding remark, it should be mentioned why statisticians are so interested in investigating data. They can use the statistics of a sample to estimate the parameters of a population. Consider the following example.
The mean of the employees' ages is calculated by dividing the sum of their ages by the number of employees. Mean=Sum of values/Number of values We know that the mean age is 24 for the 5 employees. With this information, we can determine the sum of their ages.
When the new woman is hired, the sum of the employees' ages increases by 48 and the number of people increases by 1. Let's calculate the new sum and number of values. New sum of values:& 120+48=168 New number of values:& 5+1=6 Now, we have enough information to determine the new mean of the the ages.
The new mean age is 28 years, which means that the mean increased by 28-24=4 years.
Let's assume that the sum of the values before we add 7 is S. If the number of values in the data set is n, we can write an expression for the mean of the values. Mean_(Old) = S/n Next, if we add 7 to the values, we get a new sum of S+7 and the new number of values is n+1. With this information, we can write a second expression describing the mean of the new data set. Mean_(New) = S+7/n+1 We know that the mean before and after did not change. Therefore, we can set these expressions equal to each other.
Consequently, the mean before adding the new observation was 7.
Jordan has recorded the temperature outside her house over the course of a week. Unfortunately, two of the temperatures are covered by a coffee stain. Tadeo looks at the paper and says that they have enough information to figure out the missing values.
Examining the note, we see a box plot at the bottom. Since Jordan measured the temperatures over a week, she wrote 7 observations — an odd number. This means the lower quartile, median, and upper quartile must all be observations from the data set.
If we look at the note, we see that a few values from the box plot are also written in the list — the minimum value, the lower quartile, and the upper quartile.
We know that the median is represented by a value in the data set, but we cannot find it among the listed observations. Therefore, it must be one of the observation covered by the coffee stain. Also, because Jordan noted the observation on Monday as an outlier and 33^(∘)F is not an outlier, the temperature on Tuesday must have been 33^(∘) F.
In order to find the value of the outlier, we will use the fact that the mean of the 7 values is 35. If we call the value of the outlier x, we can write the following equation.
35=x+33+34+37+30+28+32/7
Let's solve this equation for x.
The temperature on Monday was 51^(∘) F. We can see that it is indeed an outlier of the data set.