{{ 'ml-label-loading-course' | message }}
{{ toc.name }}
{{ toc.signature }}
{{ tocHeader }} {{ 'ml-btn-view-details' | message }}
{{ tocSubheader }}
{{ 'ml-toc-proceed-mlc' | message }}
{{ 'ml-toc-proceed-tbs' | message }}
Lesson
Exercises
Recommended
Tests
An error ocurred, try again later!
Chapter {{ article.chapter.number }}
{{ article.number }}. 

{{ article.displayTitle }}

{{ article.intro.summary }}
Show less Show more expand_more
{{ ability.description }} {{ ability.displayTitle }}
Lesson Settings & Tools
{{ 'ml-lesson-number-slides' | message : article.intro.bblockCount }}
{{ 'ml-lesson-number-exercises' | message : article.intro.exerciseCount }}
{{ 'ml-lesson-time-estimation' | message }}
A data set has little real meaning until it is shown in a data display to visualize trends and patterns. There are lots of data displays that can be used to explore data. This lesson will show how to use histograms and box plots to determine the frequency distribution of the data and which measures are used to describe the center and variation.

Catch-Up and Review

Here are a few recommended readings before getting started with this lesson.


Challenge

Giving Meaning to a Data Set

Tadeo and Ramsha are foodies and love to explore the restaurants that pop up in their neighborhood. They have recorded data about the average price of the main dishes in each restaurant using a table of values.

Average Main Dish Price (Dollars)

They would like to draw some conclusions from this data. However, they are not entirely sure how to proceed with this task. Find the following information to help these curious connoisseurs!

a Consider the following histograms.
Four possible histograms. Only one correctly represents the data.
Which histogram best describes the data?
b Select the option that describes the distribution of the data.
c Which measures of center and variation best describe the data?
Discussion

Frequency Distribution

A frequency distribution, sometimes called a histogram distribution, is a representation that displays the number of observations within a given interval. It is used to show the empirical or theoretical frequency of occurrence of each possible value in a data set, often recorded in a frequency table. Frequency distributions of categorical data are typically presented using a bar graph.

Bar graph

In the case of numerical data, the graphical representation of a frequency distribution is called a histogram.

Histogram
Depending on how a data set is distributed, its histogram can have different shapes. The most common types of distributions are symmetric frequency distribution and skewed frequency distribution.
Concept

Symmetric Frequency Distribution

A symmetric frequency distribution is a distribution in which the data are evenly distributed around the mean and the bars on each side of the middle bar are approximately the same height.

Symmetric Distribution
The mean and median are approximately equal in a symmetric frequency distribution. Therefore, the measures of center and spread that best describe a symmetric distribution are the mean and the standard deviation, respectively.
Concept

Skewed Frequency Distribution

A skewed frequency distribution is a distribution in which the data is not spread evenly — rather, the data is clustered at one end. In this case, the mean and the median are not equal, causing the data set to be skewed. A skewed distribution is neither symmetric nor normal. In general, there are two types of skewed frequency distributions.

Skewed Distribution Description
Skewed Left / Negatively Skewed The distribution has a long left tail and the median is greater than the mean.
Skewed Right / Positively Skewed The distribution has a long right tail and the median is less than the mean.
The difference between normal and skewed distributions can be visualized in the following applet.
normal and skewed distribution
The measures of center and spread that best describe a skewed distribution are the median and the five-number summary, respectively. The median is preferred because it is less affected by outliers, while the mean will fall in the direction of the tail of the distribution.
Example

Finding the Distribution of Runs Scored in Cricket

Tadeo and Ramsha are having fun learning about frequency distributions. Last weekend, two of the most exciting cricket games this season took place. Tadeo and Ramsha recorded the runs scored by the players in each match. The data for Games and are shown in the table.

Game Game
a Observe the following histograms.
Four possible histograms. Only one correctly represents the data.
Which histogram represents the data of Game
b Which of the following statements are true about the data of Game
c Consider the following histograms.
Four possible histograms. Only one correctly represents the data.
Which histogram represents the data of Game
d Which of the following statements are true about the data of Game

Hint

a Begin by making a frequency table of the data.
b Observe where the tail of the distribution extends.
c Make a histogram of the data using a frequency table.
d Which measures of center and variation should be used when the distribution of the data is skewed?

Solution

a Notice that the given histograms consist of seven intervals. Considering this information, create a frequency table with the seven intervals, beginning with to display the data for Game
Cricket Runs Scored in Game
Number of Runs Scored Frequency
The data can be displayed in a histogram by using this frequency table. The horizontal axis will be the Number of Runs Scored and the vertical axis the Frequency. Then the bars will be drawn to represent the frequency of each interval.
Histogram of the data of Game 1
Note that this corresponds to option A.
b The tail of the histogram extends to the right and most of the data is on the left. Therefore, the distribution of the data is skewed right. In a skewed distribution, the median and five-number summary best describe the center and variation of the data, respectively. As such, two statements apply to the data set of Game
  • The data is skewed right.
  • The median and the five-number summary best describe the data.
c Similar to Part A, in order to identify which histogram is the one that describes the data of Game a frequency table of the data will be created, this time using eight intervals.
Cricket Runs Scored in Game
Number of Runs Scored Frequency
Using this table, the histogram of the data can be now created.
Histogram of the data of Game 2
Note that this corresponds to option D.
d In this case, the tail of the histogram extends to the left and most of the data is on the right. Therefore, the data is skewed left. Additionally, the median and five-number summary best describe the center and variation of the data, respectively.
  • The data is skewed left.
  • The median and the five-number summary best describe the data.
Example

Analyzing the Retirement Ages of NFL Players

Tadeo and Ramsha are amazed by how data is presented everywhere and how knowing the distribution of the data helps to interpret that data. Besides cricket, they also love watching NFL games and are fans of Peyton Manning, the famous quarterback who retired at age Now they want to analyze the retirement ages of NFL players by collecting some data.

Retirement Age of NFL Players
Age Frequency

Based on this frequency table, Tadeo and Ramsha created a histogram and in order to draw some conclusions about the data.

a Consider the following histograms.
Four possible histograms. Only one correctly represents the data.
Which of these histograms could represent the given data set?
b Which measures of center and variation best represent the data?
c Which of the following statements is most likely true about the retirement age of NFL players?

Hint

a Use the frequency table to display the data in a histogram.
b Which measures of center and variation best describe a symmetric distribution?
c Which interval has the highest bar?

Solution

a The data can be displayed in a histogram by using the given frequency table. The vertical axis will be the Frequency and the horizontal axis the Age. Next, bars will be plotted to represent the frequency of data points falling in each interval.
Histogram of the data of Game 2
Note that this histogram corresponds to option B.
b The data on the left side of the distribution is nearly a mirror image of the data on the right side of the distribution, which means that the distribution is symmetric.
A symmetric histogram displaying the retirement ages of NFL players. The horizontal axis represents the players' ages in two-year intervals, ranging from 23-24 to 43-44. The vertical axis shows the frequency, ranging from 0 to 140 players, with increments of 20. A vertical line marks the center of the distribution, highlighting its symmetry.
In a symmetric distribution, the mean and standard deviation best describe the center and variation of the data.
c In a symmetric frequency distribution, data are distributed evenly around the mean and the bars on each side of the middle bar are about the same height.
Histogram of the data of Game 2

It can be seen that the mean of the distribution is in the interval of This means that a typical NFL player is much more likely to retire at around age or

Discussion

Uniform and Bimodal Distributions

There are special distributions that less common than skewed and symmetric distributions. These distributions may appear in situations such as an experiment where each event has the same probability or a sample taken from two separate populations. These are the uniform and bimodal distributions.

Concept

Uniform Frequency Distribution

A uniform frequency distribution, sometimes called a flat distribution, is a type of distribution where all the bars are about the same height. This type of distribution arises in scenarios where all the possible outcomes are equally likely. A uniform distribution is also symmetric.

Uniform Distribution
As an example, the possible outcomes of rolling a fair six-sided die are and and they each have an equal probability of occurring. The following applet simulates rolling a die times and records the frequency of each outcome.
Simulation of rolling a dice 100 times
It should be noted that even though each outcome is theoretically equally likely, the frequencies of the outcomes can actually be unequal when collecting data from a real experiment.
Concept

Bimodal Distribution

A bimodal distribution is a data distribution with a range of values near two individual values or two intervals, separating the data into two clusters. This causes the histogram of the data to have two peaks. The mean and the median of a bimodal distribution are near the center of the distribution.

bimodal distribution histogram

The given distribution indicates that the sampling was likely made from two different populations. The term bimodal refers to the peaks of the distribution, which differs from the mode when intervals are used to make the data display. It is worth mentioning that a bimodal distribution whose bars are about the same height on each side of the peaks is also symmetric.

Example

Consider a histogram that shows the attendance per hour at a local restaurant.

Restaurant Attendance
The peaks represent typical lunch and dinner hours. Since the histogram has two distinct peaks, it has a bimodal distribution. Traffic patterns, heights, and test scores are other examples that can show bimodal distribution. Although histograms are mainly used to show bimodality, other representations such as dot plots and leaf plots can also show bimodality.
Example

Frequency Distributions of Data From Different Sources

Tadeo and Ramsha want to explore different types of data sets to see what kind of distributions they can draw. The first data set was collected from an experiment by spinning a spinner with six equal sections times.
A spinner with 6-equal sections colored red, blue, yellow, pink, green, and orange.
Ramsha recorded the results of the experiment in a frequency table.
Color Frequency
Red
Blue
Yellow
Pink
Green
Orange

Tadeo collected the other data set from a survey about the exam scores of their classmates.

Exam Score Frequency

They now have two data sets and would like to display the data in histograms to analyze them.

a Consider the following histograms.
Four possible histograms. Only one correctly represents the data.
Select the histogram that represents the data set for the spinner experiment.
b Select the option(s) that best describes the distribution of the data of the spinner.
c Consider the following histograms representing test scores.
Four possible histograms. Only one correctly represents the data.
Which histogram represents the given data for the exam scores?
d Select the option(s) that best describes the distribution of the data about the exam scores.

Hint

a Begin by drawing a histogram of the data.
b How different are the bars of the histogram of the data?
c Use the frequency table to draw a histogram of the data.
d How many peaks does the histogram have? Draw a vertical line through the middle of the distribution.

Solution

a The data set of the spinner experiment can be displayed in a histogram. Label the horizontal axis Color and the vertical axis Frequency. Then draw the bars to represent the frequency of each color outcome.
Histogram of the results obtained from spinning a spinner 1000 times.
Notice that this corresponds to option B.
b Looking at the histogram, it can be seen that the bars are all approximately the same height.
Comparing the heights of the bars of the frequency histogram
This means that the data has a uniform distribution. Moreover, a uniform distribution is also symmetric. Therefore, it can be said that the data follows a symmetric and uniform distribution.
c Similarly to Part A, the data modeling the class's exam scores can be displayed in a histogram. The horizontal axis will be the Test Score and the vertical axis will be the Frequency.
Histogram of the exam scores in a classroom.
The histogram of the exam scores has been drawn. The right graph is given in option C.
d Notice that the histogram has two peaks. Additionally, these two peaks split data into two clusters. This means that the data follows a bimodal distribution. Moreover, suppose a vertical line is drawn around the halfway line of the distribution. The data on the left is an approximate mirror image of the data on the right.
A histogram that shows the exam scores in a classroom, displaying a bimodal distribution with two distinct peaks. The vertical axis represents the frequency of scores, ranging from 0 to 12 in increments of 2. The horizontal axis shows the score ranges, from 70-71 to 94-95. A central line highlights the presence of two similar-sized clusters, ndicating two similar clusters of scores.

This means that the distribution is bimodal and symmetric. A possible explanation for this data is that it comes from two groups, one group of students who did not study for the exam (the first peak on the left) and one group that did study for the exam (the second peak on the right).

Discussion

Box-and-Whisker Plots as Distributions

A box plot is another data display that allows one to see the shape of a frequency distribution. The length of the whiskers and the position of the median tell whether the distribution is skewed or symmetric.

  • Skewed Left: The left whisker is longer than the right and the median is closer to the right whisker.
  • Symmetric Distribution: The left whisker is about the same length as the right. The plot to the left of the median is an approximate mirror image of the plot on the right.
  • Skewed Right: The right whisker is longer than the left and the median is closer to the left whisker.
The following applet shows each of these three frequency distributions using box plots.
normal and skewed distribution using box plots
Keep in mind that to make a box plot, the five-number summary of the data must first be found. This means that box plots can only represent quantitative data.
Example

Analyzing the Ages of Customers at an Italian Restaurant

During their fantastic journey exploring the restaurants in their neighborhood, Tadeo and Ramsha found a fabulous Italian restaurant. While eating their food, they observed the people who entered the restaurant.

Pizza-place.jpg

The two are curious about the average age of people eating at this restaurant. Therefore, they decide to collect data on the ages of people who enter the restaurant during a typical day.

Ages of People Who Enter the Italian Restaurant on a Typical Day

They now want to draw some conclusions from this data set by displaying it in a box plot.

a Find the five-number summary and match each description with its corresponding value.
b Consider the following box plots.
Four box plots that possibly represent the given data set.
Which of the box plots represents the given data set?
c Which measures of center and variation best represent the data?
d Which of the following statements is most likely true about the people who enter the Italian restaurant?

Hint

a Begin by ordering the data values from least to greatest.
b Use the five-number summary to make the box plot.
c Decide whether the data is skewed or symmetric.
d Each whisker represents of the data. The box represents of the data.

Solution

a To find the five-number summary of the data set, begin by ordering the data values from least to greatest.
Notice that the minimum value is and the maximum value is Additionally, the number of data points is an even number. Therefore, the median of the data set will be given by the mean of the middle numbers and
Finally, the first quartile, or the median of the lower half, is and the third quartile is The following table summarizes this information. Each description is matched with its corresponding value.
Five Number Summary
Minimum Value
First Quartile
Median
Third Quartile
Maximum Value
b The five-number summary found in Part A can be used to draw the box plot of the data set. Start by drawing a number line that includes the minimum and maximum values of the data. Next, graph points above the number line for the five-number summary.
Points above a number line representing the five-number summary of the data set.

Now, draw a box from the quartile to the quartile. Then, draw a line through the median and the whiskers from the box to the minimum and maximum values.

The box plot of the data about the ages of attendants of an Italian Restaurant.

Notice that this plot corresponds to option A.

c In the box-plot drawn in the previous part, notice that the left whisker is longer than the right and that the median is closer to the right whisker than it is to the left.
Comparison of the distance of the median from the whiskers.
This means that the data is skewed left. In a skewed distribution, the five-number summary best describes the center and spread of the data.
d In a box plot, each whisker represents of the data and the box represents the middle of the data. With this in mind, the following facts about the distribution of the data set can be determined.
  • of the people who enter the Italian restaurant in a regular day are between and years old.
  • of the people who enter the Italian restaurant in a regular day are between and years old.
  • of the people who enter the Italian restaurant in a regular day are between and years old.

Therefore, the option that states that of the people who enter the Italian restaurant in a regular day are between and years old is the right one.

Example

Comparing Data From Two Groups

In addition to exploring restaurants, Tadeo and Ramsha spend a lot of time playing video games together. While hanging out weekend, they discussed whether boys or girls spend more time on video games on weekends. To investigate this situation, the two collected some data from their classmates at North High School and displayed it in a double-histogram.
Double histogram of hours spent playing video games on weekends by gender.
a Select the statements that are right about data set of the responses from the girls.
b Which of the following statements are true about the data set of responses from the boys?
c If one student is randomly selected from each group, which is more likely to spend more time on video games on weekends?

Hint

a Begin by identifying the distribution shape of the data.
b Identify the shape of the distribution.
c Use the distribution of each data set to identify a typical value.

Solution

a The shape of the distribution will be described to determine which of the given sentences are right about the data set showing how much time the girls spend playing video games.
Histogram of hours spent playing video games on weekends by girls.
Notice that the left side of the distribution is almost a mirror image of the data on the right side of the distribution. This means that the distribution is symmetric and that the mean and standard deviation best describe the center and variation of the data.
  • The distribution is symmetric.
  • The mean and standard deviation best describe the center and variation of the data.

This gives the two correct statements about the girls' data set.

b Now, consider the histogram of the data set that represents the responses given by the boys.
Histogram of hours spent playing video games on weekends by boys.
Notice that in this case, the tail of the distribution extends to the left and that most of the data is on the right side of the histogram. Therefore, the distribution of the data skewed left, so the five-number summary best describes the center and spread of the data.
  • The distribution is skewed left.
  • The five-number summary best describes the center and spread of the data.
c It was previously identified that the mean best describes the center of the data set for the girls and the median for the data set for the boys.
Appropriate Measures of Center
Girls' Data Set Boys' Data Set
Mean Median

To identify who is more likely to spend more time on video games, compare these measures of center. Since the girls' distribution is symmetric, its mean is probably in the interval, the center of the distribution.

Mean of girls data set.

Conversely, because in a skewed distribution, the median is righter to the center and closer to the peaks of the distribution, it is probably in the interval.

A skewed-left histogram displays the frequency of weekend hours spent by boys playing video games, with the vertical axis showing the frequency and the horizontal axis showing the hours.
Comparing these values, notice that the median of the boys is greater than the mean of the girls.
This means it is more likely that a boy spends more time playing video games on the weekends than a girl does.
Example

Using Box Plots to Interpret Results From a Survey

Ramsha and Tadeo took a survey of their classmates about the number of hours spent playing video games on the weekends. These results made Ramsha more interested in finding other everyday situations with remarkable differences due to gender. She decided to search the web for similar examples.

A young woman with medium skin tone is looking impressed while reviewing a survey on a computer screen. A thought bubble above her head indicates she is finding the results surprising or insightful.

During her investigation, Ramsha found a peculiar table showing the results of a survey comparing the amount of money that men and women usually spend on clothes per month.

Women Men
Survey Size
Minimum
Maximum
Quartile
Median
Quartile
Mean
Standard Deviation
a Consider the following double box plots.
Four box plots that possibly represent the given data set.
Four box plots that possibly represent the given data set.
Which of the given double box plots shows the results of the survey Ramsha found online?
b Which statement is true about the given data?
c How many of the women surveyed are expected to spend between and on clothes per month?

Hint

a Use the five-number summary of each data set to make a double box plot.
b Begin by identifying each data set distribution.
c Each whisker represents of the data. The box represents of the data.

Solution

a This data can be represented with a double box plot to identify which of the given graphs accurately represents the data set. First, draw a number linethat includes the minimum and maximum values of each gender's data set. Next, plot points above the number line for the given values of the five-number summary.
The five-number summary of both data sets on above a number line.

Next, draw the box for each plot using the first and third quartiles. Finally, draw a line through the median and the whiskers from the box to the minimum and maximum values of each data set.

A double box plot showing the distribution of monthly spending for two groups: women and men. The number line below the chart shows the range of spending values. The five-number summary is presented from left to right as Minimum (Min), First Quartile (Q1), Median, Third Quartile (Q3), and Maximum (Max) for both genders.

Notice that this corresponds to the box-plot in option D.

b In order to identify which of the given statements is correct, compare the center and spread of the data sets. Note that for the women's data set, the right whisker is longer than the left one and that the median is closer to the left whisker. This means that the data is skewed right, and the median best describes the center of the data.
Conversely, for the men's data set, the whiskers are approximately equal and the median falls in the middle of the box. Therefore, this data is modeled by a symmetric distribution, and the mean best describes the center of the data.
Notice that the median amount of money spent by women on clothes each month is almost twice the mean amount of money spent on clothes by men. Recall that the range of a data set is given by the difference of the minimum and maximum values. Using this information, compare the range and standard deviation of the data sets.
Standard Deviation Interquartile Range
Women
Men

Both the standard deviation and the interquartile range are greater for women. This means that there is more variability in the amount of money spent by women.

c To calculate how many of the women surveyed are expected to spend between and on clothes per month, consider that structure of a box plot. Each whisker represents of the data, and the box represents the middle With this information in mind, the following statements are true.
  • of the women surveyed are expected to spend between and
  • of the women surveyed are expected to spend between and
  • of the women surveyed are expected to spend between and
This means that the of the survey size needs to be calculated in order to determine the number of women who are expected to spend between and on clothes. Recall that women participated in the survey.
Therefore, out of the women surveyed are expected to spend between and on clothes per month.
Closure

Finding Insights and Drawing Conclusions From Histograms

All the pieces to analyzing data using histograms have been covered. This method of displaying data makes it easier to find the data distribution and determine the best measures of center and variation to describe the data set. Recall the data Tadeo and Ramsha recorded about the main dishes of the restaurants in their neighborhood at the beginning of the lesson.

Average Main Dish Price (Dollars)

The two students wanted to draw some insights and conclusions based on this data. However, they were not entirely sure how to proceed with this task. Find the following information to help these curious connoisseurs!

a Consider the following histograms.
Four possible histograms. Only one correctly represents the data.
Which histogram best describes the data?
b Select the option that describes the distribution of the data.
c Which measures of center and variation will best describe the data?

Hint

a Begin by making a frequency table of the data.
b Is there a tail to the distribution? Which way does it extend?
c Which measures of center and variation should be used when the distribution of the data is skewed?

Solution

a Note that the given histograms consist of six intervals. With this in mind, make a frequency table using six intervals, starting with
Average Main Dish Price (Dollars)
Price Range Frequency
Next, use this frequency table to display the data in a histogram. The horizontal axis will be the Price Range and the vertical axis the Frequency. Then, draw the bars to represent the frequency of each interval.
A histogram showing the distribution of the average main dish price.
Notice that this corresponds to option C.
b It can be seen in the histogram that the tail of the distribution extends to the right and that most of the data is on the left.
Highlighting the tail of the distribution.
Therefore, the distribution is skewed right.
c Consider the appropriate measures of center and variation for a skewed and a symmetric distribution.
Distribution Measure of Center Measure of Variation
Symmetric Mean Standard deviation
Skewed Median Five-number summary

Because in this situation the distribution of the data is skewed, the median and the five-number summary best describe the center and variation of the data, respectively.


Loading content