| {{ 'ml-lesson-number-slides' | message : article.intro.bblockCount }} |
| {{ 'ml-lesson-number-exercises' | message : article.intro.exerciseCount }} |
| {{ 'ml-lesson-time-estimation' | message }} |
Here are a few recommended readings before getting started with this lesson.
Consider four different situations that contain a data set.
Situation 1 | A company wants to compare the monthly sales performance of its Product A over the course of a year. |
---|---|
Situation 2 | A survey is conducted to gather data on the distribution of ages among participants in a community event. |
Situation 3 | An analysis is conducted to compare the distribution of exam scores in a class. |
Situation 4 | A study tracks the temperature variations over the course of a week in a particular city. |
Tadeo bought a new scientific magazine Imagineer,
which includes a lot of different graphs and math-related discussions on real-life topics. The first article discussed the results of a recent math competition among high-school students.
The article illustrated the grades of the participants obtained on the scale from 0 to 10. Interpret the dot plot.
Range: 8
Step 1 | Determine the center (median) by finding the middle data point. |
---|---|
Step 2 | Find the maximum and minimum values on the graph. Use these values to calculate the spread (range) of the data. |
Step 3 | Analyze the overall shape of the graph. Note any other features of interest on the graph. |
Complete each step one at a time.
First, the total number of data points should be found. This is why begin by counting all the dots in the given dot plot.
There are 15 data points. This means that there were 15 participants in the math competition that obtained a grade. The median of the data set is, therefore, the 8th data point, as it divides the set into halves. Find its value on the dot plot.
The 8th data point is 6, which means that the median or the center of the data set is 6. This measure indicates the middle value of all the grades obtained by the participants.
Now, find the maximum and minimum values on the dot plot.
The minimum grade obtained by a participant is 2 and the maximum grade is 10. To find the range, calculate the difference between these values.The next step is to analyze the overall shape of the graph.
The overall shape of this graph appears to be the bell shape of a normal distribution, meaning the grades are overall normally distributed, and the plot is symmetric in shape.
Step 1 | Draw a horizontal line to begin the dot plot. Title the dot plot based on the problem, and label the plot with the categories/numbers. When labeling the line with numbers, the numbers must be sequential and in a consecutive order. |
---|---|
Step 2 | Determine the frequency for each piece of data provided in the problem. |
Step 3 | Place dots over each category or number on the horizontal line that corresponds to the frequency for each piece of data as depicted in the table. |
Complete each step one at a time.
Grades From Last Year.
The next step is to determine the frequency of each grade obtained by the participants from last year's math competition. To do so, count how many times each grade from 1 to 9 appears and write that number in a table next to the grade.
Grade | Frequency |
---|---|
1 | 1 |
2 | 0 |
3 | 1 |
4 | 2 |
5 | 0 |
6 | 3 |
7 | 3 |
8 | 2 |
9 | 1 |
Lastly, place dots over each grade from 1 to 9 the number of times it appears in the data set. Use the values from the frequency table.
This way the dot plot for the given data set of values was formed.
A frequency distribution, sometimes called a histogram distribution, is a representation that displays the number of observations within a given interval. It is used to show the empirical or theoretical frequency of occurrence of each possible value in a data set, often recorded in a frequency table. Frequency distributions of categorical data are typically presented using a bar graph.
The most common types of distributions are symmetric frequency distribution and skewed frequency distribution.A histogram is a graphical illustration of a frequency distribution of a data set that contains numerical data. Histograms have several defining characteristics.
The next step is to make a frequency table showing how many data points lie in each interval.
Interval | Data Points | Frequency |
---|---|---|
1−10 | 4, 8 | 2 |
11−20 | 11, 11, 13, 15, 17, 19 | 6 |
21−30 | 21, 25, 26 | 3 |
31−40 | 37 | 1 |
From the frequency table, the histogram can be constructed by drawing a bar over each interval with a height corresponding to the found frequency.
Another article in Imagineer
focuses on the upcoming opening of a new screen room in the local cinema Movieton.
The theater had 2 screens for the last 10 years. The histogram in the article illustrates the distribution of ticket sales for a fiscal week in the year 2023.
Interpret the bar graph by describing its shape, center, and any extreme values if they exist. Use the bar graph to determine what day tends to have the most ticket sales, and what the average amount of ticket sales is on that day.
Dependent Variable: The number of tickets sold
Distribution: Left-Skewed
Step 1 | Identify the independent and dependent variable. |
---|---|
Step 2 | List the frequency in each bin. |
Step 3 | Interpret the data and describe the bar graph's shape. Use the interpretation to answer any questions about the data. |
Complete each step one at a time.
First, the independent and dependent variables need to be identified. The horizontal line lists days of a week while the vertical line represents the number of tickets sold.
The article analyzes how many tickets are sold on different days of a week and which day has the most sales. This means that the independent variable is the days of the week and the dependent variable is the number of tickets sold on each day.
Next, the frequency in each bin should be listed and interpreted. Use the values given in the bar graph to indicate the height of each bin. Remember that the vertical line represents the number of tickets sold on each day.
Day | Frequency |
---|---|
Monday | 64 tickets were sold. |
Tuesday | 70 tickets were sold. |
Wednesday | 62 tickets were sold. |
Thursday | 137 tickets were sold. |
Friday | 295 tickets were sold. |
Saturday | 342 tickets were sold. |
Sunday | 260 tickets were sold. |
Lastly, interpret the data and describe the bar graph's shape. The bar graph shows that the distribution of ticket sales is left-skewed.
Friday and Saturday are the days with the most number of tickets sold, 295 and 342 respectively. Also, the largest number of tickets tend to be sold on Saturday, and that number of tickets is 342.
Step 1 | Choose the number of intervals |
---|---|
Step 2 | Determine the size of the intervals |
Step 3 | Make a frequency table |
Step 4 | Draw a histogram |
Complete each step one at a time.
The next step is to make a frequency table showing how many data points lie in each interval.
Interval | Data Points | Frequency |
---|---|---|
1−20 | 7, 9, 11, 13, 15, 17, 18, 19, 20 | 9 |
21−40 | 21, 22, 23, 24, 24, 24, 25, 26, 27, 28, 30, 32, 35, 37, 39 | 15 |
41−60 | 42, 45, 49, 55, 60 | 5 |
61−80 | 70 | 1 |
From the frequency table, the histogram can be constructed by drawing a bar over each interval with a height corresponding to the found frequency.
A box plot or box and whisker plot can be used to illustrate the distribution of a data set. A box plot has three parts.
A box plot is a scaled figure, usually presented above a number line. The set of numbers used to draw the box plot is called the five-number summary of the data set. Each of the five numbers is labeled accordingly.
A box plot provides a visual illustration of the distribution of a data set. Each segment of the chart contains one quarter, or 25% of the data, and the center 50% of the data lies inside the box. The further apart the segments are, the greater the spread is for that quarter of the data.
The first and third quartiles are marked as the left and right sides of the box plot. The box plot can be completed by drawing a box between the quartiles.
The next article in Imagineer
focused on an analysis of air quality data collected from different urban and rural areas around the world. It included a box plot to visualize variations in pollution levels.
The box plot presented in the article is based on the scores from 0 to 100, where 100 is the greatest level of pollution, given to different areas by experts according to various factors of air pollution. Interpret the box plot.
Maximum: 98
Median: 63.5
First Quartile: 39
Now, consider the given box plot.
By comparing the general box plot with this one, the minimum, maximum, median, first and third quartiles can be determined.Concept | Value | Meaning |
---|---|---|
Minimum | 24 | The least level of pollution in the analyzed areas is 24 out of 100. |
Maximum | 98 | The greatest level of pollution in the analyzed areas is 98 out of 100. |
Median | 63.5 | The average level of pollution in the analyzed areas is 63.5 out of 100. |
First Quartile | 39 | 25% of the analyzed areas have the level of pollution at 39 or less. |
Third Quartile | 83 | 25% of the analyzed areas have the level of pollution at 83 or more. |
Step 1 | Order the data set from least to greatest value. Find the minimum and maximum. |
---|---|
Step 2 | Determine the median. |
Step 3 | Determine the first and third quartiles. |
Step 4 | Draw a box plot. |
Complete each step one at a time.
Next, mark the median as a vertical line segment in the range above the number line. Remember that the line for the median falls inside the box.
The first and third quartiles are marked as the left and right sides of the box plot. The box plot can be completed by drawing a box between the quartiles.
The distribution of a data set shows the arrangement of data values. Here are a few concepts that can be used to describe a distribution.
Concept | Definition |
---|---|
Cluster | Data that is grouped closely together. |
Gap | The numbers that have no data value. |
Peak | The most frequently occurring values, or mode. |
Symmetry | The left side of the distribution looks like the right side. |
Consider a distribution displayed with the following dot plot.
Since the data is evenly distributed between the left and right side, it is a symmetric distribution. It has a cluster of several data values within the interval 5−9. There are gaps at 4 and 10 because there are no data values. The value 7 is a peak because it is the most frequently occurring value.
There are different measures of center and spread available for describing a data distribution. For example, measures of center are mean and median, while measures of spread are interquartile range and mean absolute deviation. To determine which measures to use, consider the following diagram.
Note that if there is an outlier in the data distribution, the distribution is usually not symmetric.Tadeo got especially interested in the article about the Internet usage among teenagers. Curious to learn more, he decided to check out the website referenced in the article.
Describe the shape of a distribution. Choose the appropriate measures to describe the center and spread of the distribution.
The website also had a dot plot for the number of text messages sent by different teenagers in one day.
Choose the appropriate measures to describe the center and spread of the distribution. Describe the shape of the distribution.
Measure of Center: Mean
Measure of Center: Median
The data is evenly distributed between the left and right side, so it is a symmetric distribution. It only has one cluster of data values within the interval 1−8 and there are no gaps. The value 5 is a peak because it is the most frequently occurring value.
Next, decide which measures of center and spread are the most appropriate based on the symmetry of the distribution. Remember what measures can be used in this situation.
Symmetric Distribution? | Measure of Center | Measure of Spread |
---|---|---|
Yes | Mean | Mean absolute deviation |
No | Median | Interquartile range |
Since this distribution is symmetric, it is best to use the mean as a measure of center and the mean absolute deviation as the measure of spread.
Here, the left side of the data is different than the right side, so the distribution is not symmetric. Also, there are two clusters of data values within the intervals 17−20 and 22−24 separated by a gap at 21. The peak of the data set is at 23.
The most appropriate measures of center and spread can be determined by looking at the symmetry of the distribution. When the distribution is not symmetric, as in this case, it is best to use the median as a measure of center and the interquartile range as the measure of spread.
A line graph is used to show how a set of data changes over a period of time. To make a line graph, a scale and interval should be chosen. Then the pairs of data should be graphed and a line connecting the points should be drawn. Consider a table of values that represents the growth of a plant over several weeks.
Plant Growth | |||||
---|---|---|---|---|---|
Week | 1 | 2 | 3 | 4 | 5 |
Height, (in) | 1.5 | 2.3 | 4 | 6.2 | 8 |
The height data includes values from 1.5 to 8, so a scale from 0 to 10 inches with an interval of 1 inch are reasonable. The horizontal axis can represent time in weeks and the vertical axis can represent the plant height in inches. Now, the points can be plotted on a coordinate plane and connected.
By observing the upward and downward slant of lines connecting the points, the trends in the data can be described and future events can be predicted.Make a line graph using the table of values.
Interpret the line graph. In about how many hours will Tadeo and his parents reach the grandparents?
Hour | Distance Traveled, (mi) |
---|---|
1 | 70 |
2 | 135 |
3 | 203 |
4 | 278 |
5 | 348 |
The distance data includes values from 70 to 348, so a scale from 0 to 420 miles with an interval of 70 miles are reasonable. The horizontal axis can represent time in hours and the vertical axis can represent the distance traveled in miles. Now, the points can be plotted on a coordinate plane and connected.
This way the line graph representing the distance traveled by Tadeo's family to his grandparents was made.
Notice that the graph shows an upward slant of the line with a steady increase from hours 1 to 5. To predict in how many hours Tadeo will reach his grandparents, extend the graph to the point where the distance is 420 miles following the trend from the graph.
It can be predicted that Tadeo's family will reach their destination after about 6 hours of traveling.
There are several different statistical displays that can be used to represent a data set. To determine which display to choose, the following facts can be considered.
Type of Display | Best Used to... |
---|---|
Bar Graph | ...show the number of items in specific categories. |
Box Plot | ...show measures of variation for a data set. |
Histogram | ...show frequency of data divided into equal intervals. |
Line Graph | ...show change over a period of time. |
Line Plot | ...show how many times each number occurs. |
Tadeo's grandparents are 83 and 85 years old. They told Tadeo numerous fascinating stories about their life. He was wondering that his grandparents got to love such long and wonderful lives. Later he started wondering about life expectancy in different countries, so he did a little investigation.
Country | Life Expectancy |
---|---|
United States | 76.3 |
Japan | 84.5 |
Germany | 80.9 |
Brazil | 77.3 |
China | 78.2 |
India | 68.3 |
Australia | 83.3 |
South Africa | 62.4 |
Select an appropriate display for the given data set. Justify the answer.
Construct the chosen display in Part A.
Type of Display | Best Used to |
---|---|
Bar Graph | It shows the number of items in specific categories. |
Box Plot | It shows measures of variation for a set of data. It is also useful for very large data sets. |
Histogram | It shows the frequency of data divided into equal intervals. |
Line Graph | It shows the change over a period of time. |
Line Plot | It shows how many times each number occurs. |
The given data set lists the countries and the corresponding life expectancy. After analyzing the available types of displays, a bar graph looks like the best choice as it can show the data in two categories: country and life expectancy.
Country | Life Expectancy |
---|---|
United States | 76.3 |
Japan | 84.5 |
Germany | 80.9 |
Brazil | 77.3 |
China | 78.2 |
India | 68.3 |
Australia | 83.3 |
South Africa | 62.4 |
The countries can be marked on the horizontal axis and the life expectancy can be marked on the vertical axis. Next, draw bars with the height equal to the life expectancy.
Recall the four situations mentioned earlier.
Situation 1 | A company wants to compare the monthly sales performance of its Product A over the course of a year. |
---|---|
Situation 2 | A survey is conducted to gather data on the distribution of ages among participants in a community event. |
Situation 3 | An analysis is conducted for the distribution of exam scores in a class. |
Situation 4 | A study tracks the temperature variations over the course of a week in a particular city. |
In the first situation, the monthly sales performance of the Product A are compared. This means that the horizontal axis should show twelve months and the vertical axis should show the sales numbers for each month. This description fits the dot plot.
In the second situation, the data on the distribution of ages among participants in a community event is collected. This data can be illustrated by showing the ages in different intervals on the horizontal axis and the number of people in the corresponding age interval on the vertical axis. This situation can be presented by a histogram.
The third situation involves the analysis of exam scores distribution in a class. The two remaining statistical displays are a box plot and a line graph. Recall what they are best used for.
Type of Display | Best used to... |
---|---|
Box Plot | ...show measures of variation for a data set. |
Line Graph | ...show change over a period of time. |
Since the distribution of the exam scores should be illustrated and no period of time is involved, a box plot seems like the best fitting display for the situation.
The last situation includes tracking temperature variations over the course of a week. This data can be demonstrated with a line graph where the horizontal axis represents the days of the week and the vertical axis represents the temperature.