| {{ 'ml-lesson-number-slides' | message : article.intro.bblockCount }} |
| {{ 'ml-lesson-number-exercises' | message : article.intro.exerciseCount }} |
| {{ 'ml-lesson-time-estimation' | message }} |
Here are a few recommended readings before getting started with this lesson.
Try a few practice exercises before beginning the lesson. Consider the following data set.For each practice exercise give the answer rounded to two decimal places.
The dot plot below shows the distribution of a data set.
An outlier is a data point that is significantly different from the other values in the data set. It can be significantly larger or significantly smaller than the others.
Categorical data sometimes also have unusual elements; these can be called outliers as well.
However, it is best to use the term outlier when applying a mathematical process to identify it. This approach helps differentiate between an intuitive and a more formal approach.
What significantly different
means can depend on many things. For numerical data, the following definition is one of the several approaches that can be used. A data value can be considered an outlier if it is farther away from the closest quartile than a certain multiple of the interquartile range.
The box plot below shows the distribution of the heights (in inches) of all the players who have ever played for the Harlem Globetrotters basketball team. The heights of six of the players are indicated on the number line with dots.
Select all the outliers from the given list.
Compare the length of the whiskers to the width of the box.
Since the scale on the box plot is given in inches, first the given height data should be converted to inches.
Name | Height | Height in inches |
---|---|---|
Jahmani Hot ShotSwanson |
4′5′′ | 12⋅4+5=53 |
Jonte Too TallHall |
5′2′′ | 12⋅5+2=62 |
Donald DuckyMoore |
6′0′′ | 12⋅6+0=72 |
Solomon Bam BamBamiro |
6′5′′ | 12⋅6+5=77 |
Sean ElevatorWilliams |
6′10′′ | 12⋅6+10=82 |
Paul TinySturgess |
7′8′′ | 12⋅7+8=92 |
Heights in the boxed section of the chart can be considered typical; they are not outliers. Heights close to the box are not typical; still, they are not extreme.
Bam BamBamiro is the median height. This height is not an outlier.
DuckyMoore is a bit less than the first quartile but not far away. This height is not an outlier.
ElevatorWilliams is a bit more than the third quartile, but not far away. This height is not an outlier.
Hot ShotSwanson and Jonte
Too TallHall are much less than the first quartile. These heights are outliers.
TinySturgess is much more than the third quartile. This height is an outlier.
Sometimes box plots are drawn in a way that highlights outliers. In this case, the whiskers are only drawn to the last data value that is not considered an outlier, and the outliers are indicated separately.
The box plot in this example was drawn based on the data values of 696 players' heights who played for the Harlem Globetrotters over the years. Here is a list of all of the players who are classified as unusually short or tall compared to all of the other players.
Hot ShotSwanson (4′5′′)
X-OverTompkins (4′6′′)
Too TallHall (5′2′′)
TorchGeorge (5′3′′)
Pee WeeHenry (5′3′′)
TinySturgess (7′8′′)
In the last box plot, the height 5′3′′ was classified as an outlier, but the height 5′4′′ was not. The reason for this is that a graphing calculator was used, and it applied its own methodology. The histogram can give more details than a box plot and can indicate a different approach to classifying outliers.
In this context, it can be argued that only the heights below 60 inches are considered to be outliers on the low end of the data.
Identifying outliers is not a strict process. Context can modify what the generally accepted numerical method indicates.
Outlier: 4.88
Mean With Outlier: 5.45
Mean Without Outlier: 5.47
Note that one data value appears to be much less than the average of the others.
All values are above 5 except one, that is 4.88. This observation can be confirmed by placing all values on a number line.
Investigating this plot reveals that most values are between 5.3 and 5.7, while a few are below 5.3 and a few are above 5.7. The one value that is furthest away from the middle group is 4.88.It is interesting to see what the original publication says about the mean and how this mean compares to the value accepted today.
Consider the five data sets illustrated by the histograms and the corresponding box plots.
The histogram for Data B is skewed to the left. The outliers to the left of the distribution bring the mean a bit to the left of a typical data value. Note that the box plot shows that the median is close to the peak of the distribution.
The histogram for Data D is skewed to the right, with only a few extreme outliers on the right of the distribution. These outliers bring the mean a bit to the right of a typical data value. Note that the box plot shows that the median is close to the peak of the distribution.
For Data E, there is no clear peak in the histogram. This characteristic represents a uniform distribution where all values of the range are expected to appear with approximately the same frequency. Therefore, neither the mean nor median of Data E are good indicators of a typical data value.
The table below contains the mean, the median, the standard deviation, and the interquartile range for all five distributions.
Mean | Median | Standard Deviation | Interquartile Range | |
---|---|---|---|---|
Data A | 20.15 | 20 | 5.13 | 7 |
Data B | 25.36 | 28 | 8.54 | 10 |
Data C | 20.58 | 21 | 9.92 | 19 |
Data D | 7.26 | 6 | 7.52 | 3 |
Data E | 21.05 | 22 | 11.57 | 20 |
The following observations can be summarized from the previous example.
Data's Distribution | Some Observations | Preferred Statistics |
---|---|---|
Symmetric Distribution | Both the mean and the median are close to the center. | If the histogram has one peak, then both mean and median describe a typical data value. In this case, the preferred statistic is the mean, since it also considers the actual data values, not just the order. |
Skewed Distribution or With Outliers | The extreme values can distort the mean and increase the standard deviation. | In this case, the preferred statistic to describe a typical data value is the median, since it only considers the order of the data. Furthermore, it is less sensitive to extreme values. |
When analyzing the center of the data, it is often very useful to investigate the measure of spread too. Here are two examples of how different statistics can be paired.
There are several statistics used to describe a data set: mean, median, standard deviation, and interquartile range. Some are more useful than others depending on the case. Analyze the shape of the following histogram of a data set. Then, select the most appropriate pair of statistics that would best describe it. Try out a few!
As it was previously noted, outliers are characterized as unusual values in a data set. Outliers can appear for several reasons.
Possible Reason | Example |
---|---|
It can be a result of a data recording error. If this is obvious, then this data can be removed or modified. | Suppose a scientist records the length of some leaves from a tree in centimeters, measured to the nearest millimeter. In that case, a typical data entry has one decimal place. Yet, if the value 104 appears in the data, then it is likely that it was mistakenly recorded and meant to be recorded as 10.4. |
The nature of the data is such that unusual entries can occur. In this case, this entry should also be considered. Still, usually, the median better describes a typical data value than the mean. | When looking at a data set of the heights of Star Wars characters, one should expect to see a few extremely low and high values. |
As a concluding remark, it should be mentioned why statisticians are so interested in investigating data. They can use the statistics of a sample to estimate the parameters of a population. Consider the following example.