{{ 'ml-label-loading-course' | message }}
{{ toc.name }}
{{ toc.signature }}
{{ tocHeader }} {{ 'ml-btn-view-details' | message }}
{{ tocSubheader }}
{{ 'ml-toc-proceed-mlc' | message }}
{{ 'ml-toc-proceed-tbs' | message }}
Lesson
Exercises
Recommended
Tests
An error ocurred, try again later!
Chapter {{ article.chapter.number }}
{{ article.number }}. 

{{ article.displayTitle }}

{{ article.intro.summary }}
Show less Show more expand_more
{{ ability.description }} {{ ability.displayTitle }}
Lesson Settings & Tools
{{ 'ml-lesson-number-slides' | message : article.intro.bblockCount }}
{{ 'ml-lesson-number-exercises' | message : article.intro.exerciseCount }}
{{ 'ml-lesson-time-estimation' | message }}
Analyzing and comparing data is as important as collecting it. This lesson covers basic data analysis concepts. It starts with finding a central value and then moves on to measuring how spread out the data points are. A good understanding of statistical measures will be achieved through this lesson.

Catch-Up and Review

Here is a recommended readings before getting started with this lesson.

  • What is Statistics?
Challenge

Study a Data Set About The Lifespan of Cats

Emily and Ignacio love learning about animals. They believe they can make meaningful discoveries by studying data about any animal, beginning with cats. They choose to create a data set, consisting of seven data points, showing the lifespan of cats in their neighborhood. They surveyed their neighbors to get this information.

Lifespan of Cats (in years)

Answer the following questions using this data set.

a What is the average lifespan of a cat?
b Which number, if any, occurs most frequently?
c Rearrange the data from least to greatest. What number is in the middle of this sorted data set?
Discussion

What is a Data Set?

A data set is a collection of values that provides information. These values can be presented in various ways such as in numbers or categories. The values are typically gathered through measurements, surveys, or experiments. Consider a data set that consists of the heights of a group of actors.

Actor Height
Madzia
Magda
Ignacio
Henrik
Ali
Diego
Miłosz
Paulina
Aybuke
Mateusz
Gamze
Marcin
Marcial
Heichi
Arkadiusz
Enrique
Aleksandra
Mateusz
Jordan
Paula
MacKenzie
Joe
Flavio
Jeremy
Umut
A single value in a data set, such as an individual actor's height, is called an observation or data point. In this table, each observation corresponds to the height of an actor — meaning that there are observations. Each observation contains two variables, the actor's name and height.
The actual number or category associated with each data point is called a data value. Data values are the specific pieces of information contained within a data point. Data sets can be represented using charts, tables, or different types of graphs. For example, the average temperature of a city for each month of can be plotted on a line graph.
This lesson will focus on analyzing data sets resulting from the observation of a single variable.
Discussion

What is the Average of a Data Set?

The mean, or the average, of a numerical data set is one of the measures of center. It is defined as the sum of all of the data values in a set divided by the number of values in the set.

The following applet calculates the mean of the data set on the number line. Points can be moved to change the data values.

Discussion

What is the Median of a Data Set?

The median is a measure of center that lies in the middle of a numerical data set when the data set is written in numerical order. When the the data set has an odd number of data points, the median is the value in the middle.
Random data set with 9 elements, the median is marked
However, when the the data set has an even number of data points, the median is the average of the two middle numbers.
Random data set with 10 elements, the median is marked
Discussion

What is the Mode of a Data Set?

The mode is a measure of center that shows the most common value in a data set. Modes can be used for both numerical and categorical data.
Random data set with 11 elements, the mode is marked
A data set can have more than one mode if two or more data values are equally common. However, if all values in the set only occur once, then the data set does not have a mode.
Discussion

Summarizing a Data Set With a Single Number

A measure of center, or a measure of central tendency, is a statistic that summarizes a data set by finding a central value. The most common measures of center are the mean, median, and mode.
Interactive applet where points of the dot plot can be moved around.
Move the points around in the dot plot to generate new data. The applet identifies the mean, median, and mode of the data set.
Example

Studying Data about the Lifespan of Dogs

Ignacio volunteers at a dog shelter. He asks Emily to help him study a data set he made concerning the lifespan of some of the dogs. The information they gather will help the shelter! Dogs.svg This time, the data set consists of eight data points rather than seven.

Lifespan of Dogs (in years)
a What is the mean of the data set?
b What is the median of the data set?
c What is the mode of the data set?

Hint

a The mean of the data set is the sum of the data values divided by the number of data values.
b Order the data from least to greatest. What number is in the middle?
c The mode of a data set is the value that occurs most frequently.

Solution

a The mean of a data set is calculated by finding the sum of all values in the set and then dividing by the number of values in the set. In this case, there are values in the set.
Add all values and divide the sum by
The average lifespan of the dogs studied in the survey is years.
b Start by ordering the values from least to greatest.
The number of values matters when determining the median.
  • For a set with an odd number of values, the median is the middle value.
  • For a set with an even number of values, the median is the mean of the two middle values.
In this case, the data consists of an even number of values. The values in the middle are and
Therefore, the median of the data set is the mean of and
The median of this data set is years.
c Remember that the mode of a data set is the value or values that occur most often. Take another look at the given data set.
As seen, occurs two times and the rest of the numbers occurs only once. This means that the mode of the data set is years. Note that while the mean, median, and mode are close in this instance, they may vary in other cases.
Pop Quiz

Practice Finding Measures of the Center

Measures such as the mean, median, and mode are essential for understanding the central tendency of a data set. Find the indicated measure of the given data set. If the answer is not an integer number, round it to one decimal place.

Discussion

Measures of Spread of Data Sets

Similar to the measures of center, there are measures that describe how much the values in a data set differ from each other using only one measure. These measures summarize the spread of the data.

Concept

Range

Range is a measure of spread that measures the difference between the maximum and minimum values of the data set.

Random data sets with maximum, minimum highlighted and the range calculated.
Discussion

Quartiles

Quartiles are three values that divide a data set into four equal parts. The quartiles are denoted as and The second quartile also known as the median, divides the ordered data set into two halves.
The median of the lower half is the first quartile while the median of the upper half is the third quartile
The first quartile is also called lower quartile, and the third quartile is also called upper quartile. To find the quartiles of a data set, the values must first be written in numerical order.
Example of how three quartiles can be identified in a set
Discussion

Interquartile Range

The interquartile range, or IQR, of a data set is a measure of spread that measures the difference between and the upper and lower quartiles.

IQR

The following applet shows how to find the IQR of different data sets.

Applet that calculates the interquartile range of a data set
Discussion

Finding the Interquartile Range

The interquartile range (IQR) of a data set is found by first identifying the three quartiles and then calculating the difference between the third and the first quartile. Consider the following data set.
The interquartile range of the data set can be found by following these four steps.
1
Identify the Median
expand_more

First, identify the median of the given data set. Since the number of values is even, the median is the mean of the two middle values.

The median of the data is

2
Identify the Lower and the Upper Half of the Data Set
expand_more

The median divides the data into two halves, a lower half and an upper half. For this data, the lower half includes the first six values and the upper half includes the following six.

When there is an odd number of values in the data set, the middle value is excluded from both the lower and upper sets.

3
Find the First and the Third Quartile
expand_more

Find the first and the third quartile. The first quartile, is the median of the lower set, while the third, is the median of the upper set. Here, both quartiles are found the same way the median was found.

4
Calculate the Interquartile Range
expand_more
The interquartile range is calculated by subtracting the first quartile, from the third, For the given data set, the first quartile is and the third quartile is
The interquartile range of the given data set is
Example

Comparing the Weights of Cats and Dogs

Ignacio and Emily enjoyed learning about cats and dog so much that they now want to compare the spread of one data set with the spread of another. CatvsDogs.png They collected a few more data points and compiled a data set consisting of nine data points for the weights of cats.
Then, they collected a data set of ten data points for the weights of dogs.
a Which type of pet has a larger weight range: dogs or cats?
b Find the interquartile range of each data set.

Hint

a The range of the data set is the difference between the greatest and least data values.
b The interquartile range is the distance between the first and the third quartiles of the data set.

Solution

a The range is one of the measures of spread. It is the difference between the maximum and minimum values of the data set. The range of each data set will be calculated individually.

Range for the Weights of Cats

The least and greatest values can be identified without sorting the data values. Note that they can be listed in order if desired.
The least value is and the greatest value is The difference between these values is
The range for the weights of cats is pounds.

Range for the Weights of Dogs

Apply the same procedure of identifying the greatest and least values for the data set of dogs.
The least value is and the greatest value is The difference between these values is
Dogs have a weight range of pounds. This far exceeds the pound range for cats.
b The interquartile range of each data set will be calculated individually.

Interquartile Range of Cat Weights

Here, it is necessary to order the values from least to greatest. Then identify the median of the given data set. Since the number of values is an odd number, the median is the middle value.

The median of the data is Both the lower and upper halves contain four data values. Therefore, there are two middle values in each half. The median of each half is the mean of the two middle values.

The first quartile is and the third quartile is The difference between the third quartile and the first quartile is the interquartile range.
The interquartile range of cat weights is pounds.

Interquartile Range of Dog Weights

In this case, the data values are ordered from least to greatest and the number of values is an even number. This means that the median is the mean of the two middle values.

The median of the data is Both the lower and upper halves contain five data values. Therefore, there is only one middle value in each half.

The first quartile is and the third quartile is The difference between the third quartile and the first quartile is the interquartile range.
The interquartile range of dog weights is pounds.
Discussion

Five-Number Summary

A five-number summary of a data set consists of the following five values.

  1. Minimum value
  2. First quartile
  3. Median, or second quartile
  4. Third quartile
  5. Maximum value

These values provide a summary of the central tendency and spread of the data set. The five-number summary is useful for understanding the variability in a data set. When the data set is written in numerical order, the median divides the data set into two halves. The median of the lower half is the first quartile and the median of the upper half is the third quartile

Five-number summary is applied to different data sets
Discussion

Outliers

An outlier is a data point that is significantly different from the other values in the data set. It can be significantly larger or significantly smaller than the others.

Several human figures in line where one is much higher than the others.

Categorical data sometimes also have unusual elements; these can be called outliers as well.

Several human figures in line where one is of different color.
However, it is best to use the term outlier when applying a mathematical process to identify it. This approach helps differentiate between an intuitive and a more formal approach.

What Does Significantly Different Mean?

For numerical data, the following definition is one of the several approaches that can be used.

  • A data value is an outlier — significantly different from the other values — if it is farther away from the closest quartile than times the interquartile range.

Such a value was suggested by the esteemed American mathematician John Tukey. Move the slider in the following applet to see which data point is an outlier.

Identifying outliers
Example

An Unusual Value in the Data

Ignacio is relaxing, enjoying reviewing some data. Ignacio-Reading.png Wait a minute! There is something unusual about a data value in the data set for dogs.
a Identify the outlier in the data set.
b Find the range and interquartile range of the data set without the outlier.
c Which measure does the outlier affect more?

Hint

a Is there a value that is larger or smaller than most values? Check if there is any value less than or greater than
b The range of a data set is the difference between the greatest and smallest data values. The interquartile range of a data set is the distance between the first and the third quartiles of the data set.
c Compare the range and interquartile range of the data set with and without the outlier.

Solution

a In the given data set, all values seem to be around the same number, except This value seems to be significantly different from other values. Therefore, it is likely to be an outlier of the data set.
To confirm that, check if it is farther away from the closest quartile by times the interquartile range. The first quartile of this data is and the third quartile is
The interquartile range of this data set is Now calculate
This means that any value greater than is an outlier. Therefore, the value is an outlier.
b Exclude the outlier found in Part A from the data set.

Finding Range

To find its range, subtract the smallest value from the greatest.

Finding Interquartile Range

After excluding the outlier, the number of values decreased by one. There are nine values now, so the median is the middle value.

The median of the data is Both the lower and upper halves contain four data values. Therefore, there are two middle values in each half. The median of each half is the mean of the two middle values.

The first quartile is and the third quartile is The difference between the third quartile and the first quartile is the interquartile range.
The interquartile range of the data when the outlier is taken out of the data set is
c Consider the data once again, with and without outliers.
Summarize the results found in the previous parts.
Range IQR
With Outliers
Without Outliers

After removing the outlier from the data, the range decreased from to while the IQR increased from to This example shows that outliers have a bigger impact on the range of values than on the IQR.

Pop Quiz

Practice Finding Measures of Spread

Measures of spread, such as the range and interquartile range, indicate how much data values varies, while outliers are values that significantly deviate from the rest. Practice calculating these measures for the given data.