Sign In
| 9 Theory slides |
| 8 Exercises - Grade E - A |
| Each lesson is meant to take 1-2 classroom sessions |
Here are a few recommended readings before getting started with this lesson.
Try a few practice exercises before beginning the lesson. Consider the following data set.For each practice exercise give the answer rounded to two decimal places.
The dot plot below shows the distribution of a data set.
An outlier is a data point that is significantly different from the other values in the data set. It can be significantly larger or significantly smaller than the others.
Categorical data sometimes also have unusual elements; these can be called outliers as well.
However, it is best to use the term outlier when applying a mathematical process to identify it. This approach helps differentiate between an intuitive and a more formal approach.
What significantly different
means can depend on many things. For numerical data, the following definition is one of the several approaches that can be used. A data value can be considered an outlier if it is farther away from the closest quartile than a certain multiple of the interquartile range.
The box plot below shows the distribution of the heights (in inches) of all the players who have ever played for the Harlem Globetrotters basketball team. The heights of six of the players are indicated on the number line with dots.
Select all the outliers from the given list.
Compare the length of the whiskers to the width of the box.
Since the scale on the box plot is given in inches, first the given height data should be converted to inches.
Name | Height | Height in inches |
---|---|---|
Jahmani Hot ShotSwanson |
4′5′′ | 12⋅4+5=53 |
Jonte Too TallHall |
5′2′′ | 12⋅5+2=62 |
Donald DuckyMoore |
6′0′′ | 12⋅6+0=72 |
Solomon Bam BamBamiro |
6′5′′ | 12⋅6+5=77 |
Sean ElevatorWilliams |
6′10′′ | 12⋅6+10=82 |
Paul TinySturgess |
7′8′′ | 12⋅7+8=92 |
Heights in the boxed section of the chart can be considered typical; they are not outliers. Heights close to the box are not typical; still, they are not extreme.
Bam BamBamiro is the median height. This height is not an outlier.
DuckyMoore is a bit less than the first quartile but not far away. This height is not an outlier.
ElevatorWilliams is a bit more than the third quartile, but not far away. This height is not an outlier.
Hot ShotSwanson and Jonte
Too TallHall are much less than the first quartile. These heights are outliers.
TinySturgess is much more than the third quartile. This height is an outlier.
Sometimes box plots are drawn in a way that highlights outliers. In this case, the whiskers are only drawn to the last data value that is not considered an outlier, and the outliers are indicated separately.
The box plot in this example was drawn based on the data values of 696 players' heights who played for the Harlem Globetrotters over the years. Here is a list of all of the players who are classified as unusually short or tall compared to all of the other players.
Hot ShotSwanson (4′5′′)
X-OverTompkins (4′6′′)
Too TallHall (5′2′′)
TorchGeorge (5′3′′)
Pee WeeHenry (5′3′′)
TinySturgess (7′8′′)
In the last box plot, the height 5′3′′ was classified as an outlier, but the height 5′4′′ was not. The reason for this is that a graphing calculator was used, and it applied its own methodology. The histogram can give more details than a box plot and can indicate a different approach to classifying outliers.
In this context, it can be argued that only the heights below 60 inches are considered to be outliers on the low end of the data.
Identifying outliers is not a strict process. Context can modify what the generally accepted numerical method indicates.
Outlier: 4.88
Mean With Outlier: 5.45
Mean Without Outlier: 5.47
Note that one data value appears to be much less than the average of the others.
All values are above 5 except one, that is 4.88. This observation can be confirmed by placing all values on a number line.
It is interesting to see what the original publication says about the mean and how this mean compares to the value accepted today.
Consider the five data sets illustrated by the histograms and the corresponding box plots.
The histogram for Data B is skewed to the left. The outliers to the left of the distribution bring the mean a bit to the left of a typical data value. Note that the box plot shows that the median is close to the peak of the distribution.
The histogram for Data D is skewed to the right, with only a few extreme outliers on the right of the distribution. These outliers bring the mean a bit to the right of a typical data value. Note that the box plot shows that the median is close to the peak of the distribution.
For Data E, there is no clear peak in the histogram. This characteristic represents a uniform distribution where all values of the range are expected to appear with approximately the same frequency. Therefore, neither the mean nor median of Data E are good indicators of a typical data value.
The table below contains the mean, the median, the standard deviation, and the interquartile range for all five distributions.
Mean | Median | Standard Deviation | Interquartile Range | |
---|---|---|---|---|
Data A | 20.15 | 20 | 5.13 | 7 |
Data B | 25.36 | 28 | 8.54 | 10 |
Data C | 20.58 | 21 | 9.92 | 19 |
Data D | 7.26 | 6 | 7.52 | 3 |
Data E | 21.05 | 22 | 11.57 | 20 |
The following observations can be summarized from the previous example.
Data's Distribution | Some Observations | Preferred Statistics |
---|---|---|
Symmetric Distribution | Both the mean and the median are close to the center. | If the histogram has one peak, then both mean and median describe a typical data value. In this case, the preferred statistic is the mean, since it also considers the actual data values, not just the order. |
Skewed Distribution or With Outliers | The extreme values can distort the mean and increase the standard deviation. | In this case, the preferred statistic to describe a typical data value is the median, since it only considers the order of the data. Furthermore, it is less sensitive to extreme values. |
When analyzing the center of the data, it is often very useful to investigate the measure of spread too. Here are two examples of how different statistics can be paired.
There are several statistics used to describe a data set: mean, median, standard deviation, and interquartile range. Some are more useful than others depending on the case. Analyze the shape of the following histogram of a data set. Then, select the most appropriate pair of statistics that would best describe it. Try out a few!
As it was previously noted, outliers are characterized as unusual values in a data set. Outliers can appear for several reasons.
Possible Reason | Example |
---|---|
It can be a result of a data recording error. If this is obvious, then this data can be removed or modified. | Suppose a scientist records the length of some leaves from a tree in centimeters, measured to the nearest millimeter. In that case, a typical data entry has one decimal place. Yet, if the value 104 appears in the data, then it is likely that it was mistakenly recorded and meant to be recorded as 10.4. |
The nature of the data is such that unusual entries can occur. In this case, this entry should also be considered. Still, usually, the median better describes a typical data value than the mean. | When looking at a data set of the heights of Star Wars characters, one should expect to see a few extremely low and high values. |
As a concluding remark, it should be mentioned why statisticians are so interested in investigating data. They can use the statistics of a sample to estimate the parameters of a population. Consider the following example.
Consider the following four histograms.
Match the box plots to the correct histogram.
Let's start with histogram A. If a histogram is skewed right, the corresponding box plot will have more observations on its left. Notice the outlier on the right-hand side of the histogram. This corresponds to a value lying separately from the box plot. These observations suggest that A can be paired with II.
Examining histogram B, we notice that it is symmetrical around the middle. Since we have fewer observations at the edges of the histogram and more observations clustered around the middle, we get a shorter box plot. This corresponds to III.
In C, the observations are pushed off to the sides. When most observations are far away from the median, the box will be stretched. With this information, we can pair C with IV.
In D, observations are evenly distributed. When observations are evenly distributed, the box plot's four sections have roughly the same length, so we can pair D with I.
highaverage? What would be a more appropriate mean?
Since the data set is alerady written in ascending order, we can find the median immediately by determining the middle value. The data set has an even number of observations, 20, so the median will be the average of the 10^(th) and 11^(th) values. |c|c| [-1em] Observation(s) & Data [0.1em] [-0.8em] 1^(st) - 10^(th) & 0, 0, 0, 1, 3, 4, 4, 4, 4, 5 [1em] [-0.5em] Median & 5+5/2=5 [1em] [-0.8em] 11^(th) - 20^(th) & 5, 6, 6, 7, 8, 8, 9, 10, 12, 40 [0.3em] Therefore, the median tip is $5.
To find the lower quartile, we must divide the observations that are below the median into two equal halves. Notice that we again have an even number of observations, this time 10. Therefore, the lower quartile is the average of the 5^(th) and 6^(th) observations. |c|c| [-1em] Observation(s) & Data [0.1em] [-0.8em] 1^(st) - 5^(th) & 0, 0, 0, 1, 2 [0.2em] [-0.8em] Lower Quartile & 2+4/2=3 [0.8em] [-0.8em] 6^(st) - 10^(th) & 4, 4, 4, 4, 5 [0.2em] [-0.8em] Median & 5 [0.3em] [-1em] 11^(st) - 20^(th) & 5, 6, 6, 7, 8, 8, 9, 10, 12, 40 [0.3em] Consequently, the lower quartile is $3.
Depending on the distribution of the data set, the mean or the median could be a better measure of the center. We can use a histogram of the data to determine which is better. Let's first organize the data into intervals. |c|c|c| Interval & Observations & Count 0-5 & 0, 0, 0, 1, 2, [-0.5em] 4, 4, 4, 4 & 9 5-10 & 5, 5, 6, 6, 7, [-0.5em] 8, 8, 9 & 8 10-15 & 10, 12 & 2 15-20 & & 0 20-25& & 0 25-30 & & 0 30-35 & & 0 35-40 & & 0 40-45 & 40 & 1 Now we have all the information we need to draw the histogram.
Observing the histogram, we see that it is right-skewed and has an outlier at 40. Therefore, the center is best described by its median at $5 because it is not affected by the outlier.
Just like with the mean, the standard deviation takes all observations into account. The interquartile range (IQR) though is the difference between the upper quartile (Q_3) and the lower quartile (Q_1). IQR=Q_3-Q_1 This means that the interquartile range is unaffected by the outlier. Therefore, the interquartile range is a better measure.
As we found in Part D, we have an outlier in the data set that significantly increases the mean. If we remove the outlier, the mean would fall to something more representative for the population as a whole.
When we remove the outlier, the mean changes from $7 to $5. The manager was clever to notice this!
There are certain conditions that would make the mean unrepresentative of a population. If the data set has a skew to one side or if there is an outlier present, the mean will not accurately represent the data set. To determine if this is the case, we will plot the observations on a number line.
As we can see, the observations are evenly distributed and there are no clear observations that seem to be outliers. Therefore, we can use either the median or the mean to accurately describe the data set.
To calculate the mean, we have to divide the sum of the observations by the number of observations.
The mean height of the students is 161 centimeters.
To find the median, the data set first has to be written in ascending order. 1^(st) 144, 2^(nd) 153, 3^(rd) 156, 4^(th) 163, 5^(th) 174, 6^(th) 176 The median is given by the value in the middle of the data set. However, the data set has an even number of observations. Therefore, the median will be the average of the 3^(rd) and 4^(th) observations. |c|c| -3pt Observations -3pt & -3pt Data [-0.8em] 1^(st) - 3^(rd) & -3pt 144, 153, 156 [0.2em] [-0.5em] Median & -3pt 156+163/2=159.5 [1em] [-0.8em] 4^(th) - 6^(th) & -3pt 163, 174, 176 [0.3em] Consequently, the median height is 159.5 centimeters.