{{ toc.signature }}
{{ toc.name }}
{{ stepNode.name }}
Proceed to next lesson
An error ocurred, try again later!
Chapter {{ article.chapter.number }}
{{ article.number }}.

# {{ article.displayTitle }}

{{ article.introSlideInfo.summary }}
{{ 'ml-btn-show-less' | message }} {{ 'ml-btn-show-more' | message }} expand_more
##### {{ 'ml-heading-abilities-covered' | message }}
{{ ability.description }}

#### {{ 'ml-heading-lesson-settings' | message }}

{{ 'ml-lesson-show-solutions' | message }}
{{ 'ml-lesson-show-hints' | message }}
 {{ 'ml-lesson-number-slides' | message : article.introSlideInfo.bblockCount}} {{ 'ml-lesson-number-exercises' | message : article.introSlideInfo.exerciseCount}} {{ 'ml-lesson-time-estimation' | message }}
This lesson focuses on investigating extreme data values' effect on various statistical measures and graphic representations. Detailed explanations and problem-solving will be supported by interactive graphs and real-world examples.

### Catch-Up and Review

Here are a few recommended readings before getting started with this lesson.

Try a few practice exercises before beginning the lesson. Consider the following data set.

For each practice exercise give the answer rounded to two decimal places.

a Find the mean.
b Find the first quartile.
c Find the median.
d Find the third quartile.
e Find the standard deviation.

## Effects of One Data Value's Change on a Data Set

The dot plot below shows the distribution of a data set.

• Move the blue dot on the number line to explore how the change of one data value affects the median, the mean, and the standard deviation.
• Move the slider to explore how the effect of the change in one data value depends on the size of the data set. To start investigating extreme data points, they first should be properly defined.

## Outlier

An outlier is a data point that is significantly different from the other values in the data set. It can be significantly larger or significantly smaller than the others. Categorical data sometimes also have unusual elements; these can be called outliers as well. However, it is best to use the term outlier when applying a mathematical process to identify it. This approach helps differentiate between an intuitive and a more formal approach.

### Extra

Explaining "Significantly Different"

What significantly different means can depend on many things. For numerical data, the following definition is one of the several approaches that can be used. A data value can be considered an outlier if it is farther away from the closest quartile than a certain multiple of the interquartile range.

• The multiplier is commonly used to identify outliers. This value was suggested by the esteemed American mathematician John Tukey.
• The multiplier is sometimes used to identify extreme outliers.
The diagram below shows a box and whisker plot of a data set. Move the slider to see which data point is an outlier according to the description above. ## Identifying Outliers in a Data Set

The box plot below shows the distribution of the heights (in inches) of all the players who have ever played for the Harlem Globetrotters basketball team. The heights of six of the players are indicated on the number line with dots. Select all the outliers from the given list.

### Hint

Compare the length of the whiskers to the width of the box.

### Solution

Since the scale on the box plot is given in inches, first the given height data should be converted to inches.

Name Height Height in inches
Jahmani Hot Shot Swanson
Jonte Too Tall Hall
Donald Ducky Moore
Solomon Bam Bam Bamiro
Sean Elevator Williams
Paul Tiny Sturgess

Heights in the boxed section of the chart can be considered typical; they are not outliers. Heights close to the box are not typical; still, they are not extreme.

• The height of Solomon Bam Bam Bamiro is the median height. This height is not an outlier.
• The height of Donald Ducky Moore is a bit less than the first quartile but not far away. This height is not an outlier.
• The height of Sean Elevator Williams is a bit more than the third quartile, but not far away. This height is not an outlier.
• The heights of Jahmani Hot Shot Swanson and Jonte Too Tall Hall are much less than the first quartile. These heights are outliers.
• The height of Paul Tiny Sturgess is much more than the third quartile. This height is an outlier.

### Extra

Sometimes box plots are drawn in a way that highlights outliers. In this case, the whiskers are only drawn to the last data value that is not considered an outlier, and the outliers are indicated separately. The box plot in this example was drawn based on the data values of players' heights who played for the Harlem Globetrotters over the years. Here is a list of all of the players who are classified as unusually short or tall compared to all of the other players.

• Jahmani Hot Shot Swanson ()
• Justion X-Over Tompkins ()
• Jonte Too Tall Hall ()
• Cherelle Torch George ()
• Jimmy Pee Wee Henry ()
• Dut Mayar ()
• Paul Tiny Sturgess ()

In the last box plot, the height was classified as an outlier, but the height was not. The reason for this is that a graphing calculator was used, and it applied its own methodology. The histogram can give more details than a box plot and can indicate a different approach to classifying outliers. In this context, it can be argued that only the heights below inches are considered to be outliers on the low end of the data. Identifying outliers is not a strict process. Context can modify what the generally accepted numerical method indicates.

## Analyzing the Effect of Outliers on the Mean

In 1798 Henry Cavendish published the results of experiments aimed to determine the density of the Earth. The table below shows his measurements relative to the density of water.
Identify an outlier and calculate the mean of the data with and without including the outlier.

Outlier:
Mean With Outlier:
Mean Without Outlier:

### Hint

Note that one data value appears to be much less than the average of the others.

### Solution

All values are above except one, that is This observation can be confirmed by placing all values on a number line. Investigating this plot reveals that most values are between and while a few are below and a few are above The one value that is furthest away from the middle group is
Calculators can find the mean of the data. At the time Cavendish published this experiment, electronic calculators were not in existence. By hand, the mean was calculated by dividing the sum of all the values by
To find the mean without including the outlier, can be subtracted from the sum and the result divided by

### Extra

It is interesting to see what the original publication says about the mean and how this mean compares to the value accepted today.

• In his paper Cavendish claimed that the mean is That, however, does not match either of the calculated values. Some sources note that if is replaced by then the mean is as claimed by Cavendish. Is it possible that Cavendish made this adjustment?
• In actuality, during the experiment, Cavendish modified the equipment a bit after the first six results. If these results are removed, the mean of the remaining data values is • The density of the Earth measured by advanced modern techniques is How amazing that Cavendish was so close to this value more than years ago.

## Statistics of Data with Different Histogram Shapes

Consider the five data sets illustrated by the histograms and the corresponding box plots.     a For three of the data sets, the mean and the median are approximately the same. Determine which three.
b For two of the data sets, the mean and median do not describe a typical value of the data set well. Determine which two.

### Hint

a Look for symmetric or skewed distributions.
b Look for the distribution where the median represented in the box plot is not close to the peak of the histogram.

### Solution

a Note that, approximately, for symmetric distributions, both the mean and the median are close to the midpoint of the range. Data A, C, and E have approximately symmetric histograms and box plots. Therefore, for these data sets, the mean is close to the median.   The histogram for Data B is skewed to the left. The outliers to the left of the distribution bring the mean a bit to the left of a typical data value. Note that the box plot shows that the median is close to the peak of the distribution. The histogram for Data D is skewed to the right, with only a few extreme outliers on the right of the distribution. These outliers bring the mean a bit to the right of a typical data value. Note that the box plot shows that the median is close to the peak of the distribution. b For Data C, there are two peaks in the histogram. This characteristic represents a type of distribution called bimodal distribution. Correspondingly, both the mean and the median are close to the center of the range, between the peaks. Therefore, neither the mean nor median of Data C are good indicators of a typical data value. For Data E, there is no clear peak in the histogram. This characteristic represents a uniform distribution where all values of the range are expected to appear with approximately the same frequency. Therefore, neither the mean nor median of Data E are good indicators of a typical data value. ### Extra

The table below contains the mean, the median, the standard deviation, and the interquartile range for all five distributions.

Mean Median Standard Deviation Interquartile Range
Data A
Data B
Data C
Data D
Data E

## Connection Between Different Data Distributions and Best Statistics

The following observations can be summarized from the previous example.

Data's Distribution Some Observations Preferred Statistics
Symmetric Distribution Both the mean and the median are close to the center. If the histogram has one peak, then both mean and median describe a typical data value. In this case, the preferred statistic is the mean, since it also considers the actual data values, not just the order.
Skewed Distribution or With Outliers The extreme values can distort the mean and increase the standard deviation. In this case, the preferred statistic to describe a typical data value is the median, since it only considers the order of the data. Furthermore, it is less sensitive to extreme values.

When analyzing the center of the data, it is often very useful to investigate the measure of spread too. Here are two examples of how different statistics can be paired.

• The mean is paired with the standard deviation, since both formulas use the actual data values.
• The median is paired with the interquartile range, since both of these are worked out using the order of the data.

## Choosing the Best Statistics for Data Sets

There are several statistics used to describe a data set: mean, median, standard deviation, and interquartile range. Some are more useful than others depending on the case. Analyze the shape of the following histogram of a data set. Then, select the most appropriate pair of statistics that would best describe it. Try out a few! ## Why Do Outliers Appear?

As it was previously noted, outliers are characterized as unusual values in a data set. Outliers can appear for several reasons.

Possible Reason Example
It can be a result of a data recording error. If this is obvious, then this data can be removed or modified. Suppose a scientist records the length of some leaves from a tree in centimeters, measured to the nearest millimeter. In that case, a typical data entry has one decimal place. Yet, if the value appears in the data, then it is likely that it was mistakenly recorded and meant to be recorded as
The nature of the data is such that unusual entries can occur. In this case, this entry should also be considered. Still, usually, the median better describes a typical data value than the mean. When looking at a data set of the heights of Star Wars characters, one should expect to see a few extremely low and high values.

### Extra

As a concluding remark, it should be mentioned why statisticians are so interested in investigating data. They can use the statistics of a sample to estimate the parameters of a population. Consider the following example.

• Suppose a biologist would like to find the average wingspan of a certain species of bird. They can catch a few birds and measure their wingspan. Based on that sample, they can then make estimates. Although the true average wingspan will not be known, the biologist will have a reasonable estimate.
The following applet illustrates this process. It generates a set of numbers, which can be seen as the population, then chooses a sample. The population is illustrated using the red dot plot. The sample is shown in blue. The applet shows the true mean of all the numbers and the mean of the numbers in the sample. Are these means close to each other? Does the difference depend on the size of the sample? Remember, scientists usually do not know the population data. Scientists need to work with the information the sample gives them. Sounds extremely intriguing!

{{ subexercise.title }}