{{ 'ml-label-loading-course' | message }}
{{ toc.name }}
{{ toc.signature }}
{{ tocHeader }} {{ 'ml-btn-view-details' | message }}
{{ tocSubheader }}
{{ 'ml-toc-proceed-mlc' | message }}
{{ 'ml-toc-proceed-tbs' | message }}
Lesson
Exercises
Recommended
Tests
An error ocurred, try again later!
Chapter {{ article.chapter.number }}
{{ article.number }}. 

{{ article.displayTitle }}

{{ article.intro.summary }}
Show less Show more expand_more
{{ ability.description }} {{ ability.displayTitle }}
Lesson Settings & Tools
{{ 'ml-lesson-number-slides' | message : article.intro.bblockCount }}
{{ 'ml-lesson-number-exercises' | message : article.intro.exerciseCount }}
{{ 'ml-lesson-time-estimation' | message }}
Analyzing and comparing data is as important as collecting it. This lesson covers basic data analysis concepts. It starts with finding a central value and then moves on to measuring how spread out the data points are. A good understanding of statistical measures will be achieved through this lesson.

Catch-Up and Review

Here is a recommended readings before getting started with this lesson.

  • What is Statistics?
Challenge

Study a Data Set About The Lifespan of Cats

Emily and Ignacio love learning about animals. They believe they can make meaningful discoveries by studying data about any animal, beginning with cats. They choose to create a data set, consisting of seven data points, showing the lifespan of cats in their neighborhood. They surveyed their neighbors to get this information.

Lifespan of Cats (in years)

Answer the following questions using this data set.

a What is the average lifespan of a cat?
b Which number, if any, occurs most frequently?
c Rearrange the data from least to greatest. What number is in the middle of this sorted data set?
Discussion

What is a Data Set?

A data set is a collection of values that provides information. These values can be presented in various ways such as in numbers or categories. The values are typically gathered through measurements, surveys, or experiments. Consider a data set that consists of the heights of a group of actors.

Actor Height
Madzia
Magda
Ignacio
Henrik
Ali
Diego
Miłosz
Paulina
Aybuke
Mateusz
Gamze
Marcin
Marcial
Heichi
Arkadiusz
Enrique
Aleksandra
Mateusz
Jordan
Paula
MacKenzie
Joe
Flavio
Jeremy
Umut
A single value in a data set, such as an individual actor's height, is called an observation or data point. In this table, each observation corresponds to the height of an actor — meaning that there are observations. Each observation contains two variables, the actor's name and height.
The actual number or category associated with each data point is called a data value. Data values are the specific pieces of information contained within a data point. Data sets can be represented using charts, tables, or different types of graphs. For example, the average temperature of a city for each month of can be plotted on a line graph.
This lesson will focus on analyzing data sets resulting from the observation of a single variable.
Discussion

What is the Average of a Data Set?

The mean, or the average, of a numerical data set is one of the measures of center. It is defined as the sum of all of the data values in a set divided by the number of values in the set.

The following applet calculates the mean of the data set on the number line. Points can be moved to change the data values.

Discussion

What is the Median of a Data Set?

The median is a measure of center that lies in the middle of a numerical data set when the data set is written in numerical order. When the the data set has an odd number of data points, the median is the value in the middle.
Random data set with 9 elements, the median is marked
However, when the the data set has an even number of data points, the median is the average of the two middle numbers.
Random data set with 10 elements, the median is marked
Discussion

What is the Mode of a Data Set?

The mode is a measure of center that shows the most common value in a data set. Modes can be used for both numerical and categorical data.
Random data set with 11 elements, the mode is marked
A data set can have more than one mode if two or more data values are equally common. However, if all values in the set only occur once, then the data set does not have a mode.
Discussion

Summarizing a Data Set With a Single Number

A measure of center, or a measure of central tendency, is a statistic that summarizes a data set by finding a central value. The most common measures of center are the mean, median, and mode.
Interactive applet where points of the dot plot can be moved around.
Move the points around in the dot plot to generate new data. The applet identifies the mean, median, and mode of the data set.
Example

Studying Data about the Lifespan of Dogs

Ignacio volunteers at a dog shelter. He asks Emily to help him study a data set he made concerning the lifespan of some of the dogs. The information they gather will help the shelter! Dogs.svg This time, the data set consists of eight data points rather than seven.

Lifespan of Dogs (in years)
a What is the mean of the data set?
b What is the median of the data set?
c What is the mode of the data set?

Hint

a The mean of the data set is the sum of the data values divided by the number of data values.
b Order the data from least to greatest. What number is in the middle?
c The mode of a data set is the value that occurs most frequently.

Solution

a The mean of a data set is calculated by finding the sum of all values in the set and then dividing by the number of values in the set. In this case, there are values in the set.
Add all values and divide the sum by
The average lifespan of the dogs studied in the survey is years.
b Start by ordering the values from least to greatest.
The number of values matters when determining the median.
  • For a set with an odd number of values, the median is the middle value.
  • For a set with an even number of values, the median is the mean of the two middle values.
In this case, the data consists of an even number of values. The values in the middle are and
Therefore, the median of the data set is the mean of and
The median of this data set is years.
c Remember that the mode of a data set is the value or values that occur most often. Take another look at the given data set.
As seen, occurs two times and the rest of the numbers occurs only once. This means that the mode of the data set is years. Note that while the mean, median, and mode are close in this instance, they may vary in other cases.
Pop Quiz

Practice Finding Measures of the Center

Measures such as the mean, median, and mode are essential for understanding the central tendency of a data set. Find the indicated measure of the given data set. If the answer is not an integer number, round it to one decimal place.

Discussion

Measures of Spread of Data Sets

Similar to the measures of center, there are measures that describe how much the values in a data set differ from each other using only one measure. These measures summarize the spread of the data.

Concept

Range

Range is a measure of spread that measures the difference between the maximum and minimum values of the data set.

Random data sets with maximum, minimum highlighted and the range calculated.
Discussion

Quartiles

Quartiles are three values that divide a data set into four equal parts. The quartiles are denoted as and The second quartile also known as the median, divides the ordered data set into two halves.
The median of the lower half is the first quartile while the median of the upper half is the third quartile
The first quartile is also called lower quartile, and the third quartile is also called upper quartile. To find the quartiles of a data set, the values must first be written in numerical order.
Example of how three quartiles can be identified in a set
Discussion

Interquartile Range

The interquartile range, or IQR, of a data set is a measure of spread that measures the difference between and the upper and lower quartiles.

IQR

The following applet shows how to find the IQR of different data sets.

Applet that calculates the interquartile range of a data set
Discussion

Finding the Interquartile Range

The interquartile range (IQR) of a data set is found by first identifying the three quartiles and then calculating the difference between the third and the first quartile. Consider the following data set.
The interquartile range of the data set can be found by following these four steps.
1
Identify the Median
expand_more

First, identify the median of the given data set. Since the number of values is even, the median is the mean of the two middle values.

The median of the data is

2
Identify the Lower and the Upper Half of the Data Set
expand_more

The median divides the data into two halves, a lower half and an upper half. For this data, the lower half includes the first six values and the upper half includes the following six.

When there is an odd number of values in the data set, the middle value is excluded from both the lower and upper sets.

3
Find the First and the Third Quartile
expand_more

Find the first and the third quartile. The first quartile, is the median of the lower set, while the third, is the median of the upper set. Here, both quartiles are found the same way the median was found.

4
Calculate the Interquartile Range
expand_more
The interquartile range is calculated by subtracting the first quartile, from the third, For the given data set, the first quartile is and the third quartile is
The interquartile range of the given data set is
Example

Comparing the Weights of Cats and Dogs

Ignacio and Emily enjoyed learning about cats and dog so much that they now want to compare the spread of one data set with the spread of another. CatvsDogs.png They collected a few more data points and compiled a data set consisting of nine data points for the weights of cats.
Then, they collected a data set of ten data points for the weights of dogs.
a Which type of pet has a larger weight range: dogs or cats?
b Find the interquartile range of each data set.

Hint

a The range of the data set is the difference between the greatest and least data values.
b The interquartile range is the distance between the first and the third quartiles of the data set.

Solution

a The range is one of the measures of spread. It is the difference between the maximum and minimum values of the data set. The range of each data set will be calculated individually.

Range for the Weights of Cats

The least and greatest values can be identified without sorting the data values. Note that they can be listed in order if desired.
The least value is and the greatest value is The difference between these values is
The range for the weights of cats is pounds.

Range for the Weights of Dogs

Apply the same procedure of identifying the greatest and least values for the data set of dogs.
The least value is and the greatest value is The difference between these values is
Dogs have a weight range of pounds. This far exceeds the pound range for cats.
b The interquartile range of each data set will be calculated individually.

Interquartile Range of Cat Weights

Here, it is necessary to order the values from least to greatest. Then identify the median of the given data set. Since the number of values is an odd number, the median is the middle value.

The median of the data is Both the lower and upper halves contain four data values. Therefore, there are two middle values in each half. The median of each half is the mean of the two middle values.

The first quartile is and the third quartile is The difference between the third quartile and the first quartile is the interquartile range.
The interquartile range of cat weights is pounds.

Interquartile Range of Dog Weights

In this case, the data values are ordered from least to greatest and the number of values is an even number. This means that the median is the mean of the two middle values.

The median of the data is Both the lower and upper halves contain five data values. Therefore, there is only one middle value in each half.

The first quartile is and the third quartile is The difference between the third quartile and the first quartile is the interquartile range.
The interquartile range of dog weights is pounds.
Discussion

Five-Number Summary

A five-number summary of a data set consists of the following five values.

  1. Minimum value
  2. First quartile
  3. Median, or second quartile
  4. Third quartile
  5. Maximum value

These values provide a summary of the central tendency and spread of the data set. The five-number summary is useful for understanding the variability in a data set. When the data set is written in numerical order, the median divides the data set into two halves. The median of the lower half is the first quartile and the median of the upper half is the third quartile

Five-number summary is applied to different data sets
Discussion

Outliers

An outlier is a data point that is significantly different from the other values in the data set. It can be significantly larger or significantly smaller than the others.

Several human figures in line where one is much higher than the others.

Categorical data sometimes also have unusual elements; these can be called outliers as well.

Several human figures in line where one is of different color.
However, it is best to use the term outlier when applying a mathematical process to identify it. This approach helps differentiate between an intuitive and a more formal approach.

What Does Significantly Different Mean?

For numerical data, the following definition is one of the several approaches that can be used.

  • A data value is an outlier — significantly different from the other values — if it is farther away from the closest quartile than times the interquartile range.

Such a value was suggested by the esteemed American mathematician John Tukey. Move the slider in the following applet to see which data point is an outlier.

Identifying outliers
Example

An Unusual Value in the Data

Ignacio is relaxing, enjoying reviewing some data. Ignacio-Reading.png Wait a minute! There is something unusual about a data value in the data set for dogs.
a Identify the outlier in the data set.
b Find the range and interquartile range of the data set without the outlier.
c Which measure does the outlier affect more?

Hint

a Is there a value that is larger or smaller than most values? Check if there is any value less than or greater than
b The range of a data set is the difference between the greatest and smallest data values. The interquartile range of a data set is the distance between the first and the third quartiles of the data set.
c Compare the range and interquartile range of the data set with and without the outlier.

Solution

a In the given data set, all values seem to be around the same number, except This value seems to be significantly different from other values. Therefore, it is likely to be an outlier of the data set.
To confirm that, check if it is farther away from the closest quartile by times the interquartile range. The first quartile of this data is and the third quartile is
The interquartile range of this data set is Now calculate
This means that any value greater than is an outlier. Therefore, the value is an outlier.
b Exclude the outlier found in Part A from the data set.

Finding Range

To find its range, subtract the smallest value from the greatest.

Finding Interquartile Range

After excluding the outlier, the number of values decreased by one. There are nine values now, so the median is the middle value.

The median of the data is Both the lower and upper halves contain four data values. Therefore, there are two middle values in each half. The median of each half is the mean of the two middle values.

The first quartile is and the third quartile is The difference between the third quartile and the first quartile is the interquartile range.
The interquartile range of the data when the outlier is taken out of the data set is
c Consider the data once again, with and without outliers.
Summarize the results found in the previous parts.
Range IQR
With Outliers
Without Outliers

After removing the outlier from the data, the range decreased from to while the IQR increased from to This example shows that outliers have a bigger impact on the range of values than on the IQR.

Pop Quiz

Practice Finding Measures of Spread

Measures of spread, such as the range and interquartile range, indicate how much data values varies, while outliers are values that significantly deviate from the rest. Practice calculating these measures for the given data.

Discussion

Mean Absolute Deviation

The mean absolute deviation (MAD) is a measure of the spread of a data set that measures how much the data elements differ from the mean. The mean absolute deviation is the average distance between each data value and the mean.

Calculating the MAD involves determining the absolute difference between every data point and the mean, followed by averaging these absolute differences. The applet below calculates the mean absolute deviation for the data set on the number line. Move the points around to change the data.
Applet to calculate the mean absolute deviation
A large MAD value indicates that data points deviate considerably from the mean — that is, there is significant variation within the data set.
Discussion

Finding the Mean Absolute Deviation

The mean absolute deviation is a measure that describes the average absolute difference between the data points in a data set and the mean of the data set. It is calculated by finding the absolute difference between each data point and the mean, then taking the average of those absolute differences. Consider for example the following data set.
The values are the scores of students on a math test. The mean absolute deviation of the data set can be found by following these three steps.
1
Calculate the Mean
expand_more
The mean of a data set is the sum of all values in the set divided by the number of values.
The mean of the data is
2
Calculate the Distance Between Each Data Point and the Mean
expand_more

Next, calculate the absolute value of the differences between each data value and the mean.

Data Value Absolute Value of Difference
3
Calculate the Average of the Distances Found in Step
expand_more
Find the average of the absolute values of the differences between each data value and the mean.
The mean absolute deviation for the given data set is This means that the average distance each data value is from the mean is points. In other words, on average, the students' test scores deviate from the mean of by points.
Example

Studying the Mean Absolute Deviation of Cat Heights

Ignacio and Emily are researching the variation in the heights of cats from the mean. Cats-heights.jpg They are most interested in calculating the mean absolute deviation of the cat's heights to better understand how the sizes of the cats vary.
Find the mean absolute deviation of the cat's heights. Round the answer to one decimal place.

Hint

Start by finding the mean. Then, calculate the distances between the mean and each data value. Finally, find the mean of these distances.

Solution

Begin by recalling what is the mean absolute deviation.

Mean Absolute Deviation

An average of how much data values differ from the mean.

To find the mean absolute deviation, these steps can be followed.

  1. Find the mean of the data
  2. Find the distance between each data value and the mean
  3. Find the average of the distances found in Step
Start by calculating the mean. The given data set consists of the heights of cats, measured in inches.
The mean is a sum of all values divided by the number of them.
The mean of the data set is inches. Now, move on to finding the distances between the data values and the mean.
Data Value Absolute Value of Difference
Finally, add the values found in the table and then divide the sum by the number of values,
The mean absolute deviation is about inches. This is the average distance of each data value from the mean. On average, the heights of cats deviate from the mean of inches by about inches.
Discussion

Standard Deviation

The standard deviation is a measure of spread of a data set that measures how much the data values differ from the mean. The Greek letter — read as sigma — is commonly used to denote the standard deviation. In a given set of data, most of the values fall within one standard deviation of the mean.
For example, if the mean of a data set is and the standard deviation is then most of the values fall between and as and

Standard deviation shows the variation of data from the mean.

  • If the standard deviation is small, it means the values in the data set are close to the mean.
  • If the standard deviation is large, it means the values are spread out over a wider range.
Calculating standard deviation is not the focus of this lesson. However, the following applet illustrates how to calculate the standard deviation for five data points.
Applet that calculates the standard deviation of a set of five numbers
As illustrated, the standard deviation is the square root of the average of the squared differences between each value in the data set and the mean of the data set.
Example

Data Values Beyond One Standard Deviation

Finally, Ignacio and Emily are examining the variation in the heights of dogs. Dogs-heights.jpg Here are the values they are analyzing.
They found that the standard deviation of the heights is inches. Which heights are not within one standard deviation from the mean?

Hint

First find the mean of the given data set. Then, find the range of the values that are within one standard deviation from the mean.

Solution

Start by calculating the mean. The given data set consists of the heights of dogs, measured in inches.
The mean is the sum of all values divided by the total number of values.
The mean of the data set is inches. Now, find the range of values that are within one standard deviation of the mean! To do that, subtract one standard deviation from the mean and add one standard deviation to the mean — record both numbers.
The data values that are between and inches are within one standard deviation of the mean.

The values that are less than are and and the values that are greater than are and That means the heights outside the range of one standard deviation from the mean are and inches.

Closure

Finding The Measures of Center of a Data Set

In this lesson, the measures of center and measures of spread were discussed.

Measures of Center Measures of Spread
Mean
Mode
Median
Range
Interquartile Range
Mean Absolute Deviation
Standard Deviation
Considering the definitions of each concept covered, the challenge presented at the beginning of the lesson is more doable. The challenge is to find the measures of center for the lifespan of cats.
Remember that the mean of a data set is calculated by finding the sum of all values in the set and then dividing by the number of values in the set. Move the points to calculate the mean of the data set.
The average lifespan of cats is about years. To find the mode and median of the data set, the values are ordered from least to greatest.
Since the data set has an odd number of values, the median is the middle value, years. Furthermore, this data set has two modes, and years, as both values appear twice in the data set.
Loading content