| {{ 'ml-lesson-number-slides' | message : article.intro.bblockCount }} |
| {{ 'ml-lesson-number-exercises' | message : article.intro.exerciseCount }} |
| {{ 'ml-lesson-time-estimation' | message }} |
Here are a few recommended readings before getting started with this lesson.
Tadeo and Ramsha are foodies and love to explore the restaurants that pop up in their neighborhood. They have recorded data about the average price of the main dishes in each restaurant using a table of values.
Average Main Dish Price (Dollars) | ||||
---|---|---|---|---|
10.12 | 9.29 | 8.29 | 9.78 | 10.69 |
9.68 | 12.09 | 8.94 | 10.81 | 8.62 |
11.39 | 12.62 | 8.71 | 10.74 | 10.52 |
10.77 | 10.15 | 9.18 | 8.45 | 9.52 |
11.89 | 9.77 | 9.44 | 13.24 | 11.01 |
10.62 | 9.38 | 12.15 | 9.68 | 9.60 |
10.32 | 11.31 | 11.41 | 8.62 | 9.27 |
10.96 | 9.18 | 10.28 | 10.71 | 10.02 |
They would like to draw some conclusions from this data. However, they are not entirely sure how to proceed with this task. Find the following information to help these curious connoisseurs!
A frequency distribution, sometimes called a histogram distribution, is a representation that displays the number of observations within a given interval. It is used to show the empirical or theoretical frequency of occurrence of each possible value in a data set, often recorded in a frequency table. Frequency distributions of categorical data are typically presented using a bar graph.
In the case of numerical data, the graphical representation of a frequency distribution is called a histogram.
Depending on how a data set is distributed, its histogram can have different shapes. The most common types of distributions are symmetric frequency distribution and skewed frequency distribution.A symmetric frequency distribution is a distribution in which the data are evenly distributed around the mean and the bars on each side of the middle bar are approximately the same height.
The mean and median are approximately equal in a symmetric frequency distribution. Therefore, the measures of center and spread that best describe a symmetric distribution are the mean and the standard deviation, respectively.A skewed frequency distribution is a distribution in which the data is not spread evenly — rather, the data is clustered at one end. In this case, the mean and the median are not equal, causing the data set to be skewed. A skewed distribution is neither symmetric nor normal. In general, there are two types of skewed frequency distributions.
Skewed Distribution | Description |
---|---|
Skewed Left / Negatively Skewed | The distribution has a long left tail and the median is greater than the mean. |
Skewed Right / Positively Skewed | The distribution has a long right tail and the median is less than the mean. |
Tadeo and Ramsha are having fun learning about frequency distributions. Last weekend, two of the most exciting cricket games this season took place. Tadeo and Ramsha recorded the runs scored by the 22 players in each match. The data for Games 1 and 2 are shown in the table.
Game 1 | Game 2 | ||||||
---|---|---|---|---|---|---|---|
32 | 21 | 27 | 46 | 114 | 87 | 96 | 92 |
9 | 16 | 19 | 19 | 101 | 111 | 80 | 106 |
40 | 28 | 42 | 36 | 85 | 112 | 117 | 94 |
11 | 38 | 23 | 28 | 62 | 43 | 106 | 66 |
8 | 18 | 26 | 59 | 104 | 51 | 76 | 91 |
62 | 40 | 111 | 78 |
Cricket Runs Scored in Game 1 | |
---|---|
Number of Runs Scored | Frequency |
0−9 | 2 |
10−19 | 5 |
20−29 | 6 |
30−39 | 3 |
40−49 | 4 |
50−59 | 1 |
60−69 | 1 |
Number of Runs Scoredand the vertical axis the
Frequency.Then the bars will be drawn to represent the frequency of each interval.
Cricket Runs Scored in Game 2 | |
---|---|
Number of Runs Scored | Frequency |
40−49 | 1 |
50−59 | 1 |
60−69 | 2 |
70−79 | 2 |
80−89 | 3 |
90−99 | 4 |
100−109 | 4 |
110−119 | 5 |
Tadeo and Ramsha are amazed by how data is presented everywhere and how knowing the distribution of the data helps to interpret that data. Besides cricket, they also love watching NFL games and are fans of Peyton Manning, the famous quarterback who retired at age 40. Now they want to analyze the retirement ages of NFL players by collecting some data.
Retirement Age of NFL Players | |
---|---|
Age | Frequency |
25−26 | 33 |
27−28 | 67 |
29−30 | 93 |
31−32 | 109 |
33−34 | 127 |
35−36 | 114 |
37−38 | 80 |
39−40 | 59 |
41−42 | 43 |
Based on this frequency table, Tadeo and Ramsha created a histogram and in order to draw some conclusions about the data.
Frequencyand the horizontal axis the
Age.Next, bars will be plotted to represent the frequency of data points falling in each interval.
It can be seen that the mean of the distribution is in the interval of 33−34. This means that a typical NFL player is much more likely to retire at around age 33 or 34.
There are special distributions that less common than skewed and symmetric distributions. These distributions may appear in situations such as an experiment where each event has the same probability or a sample taken from two separate populations. These are the uniform and bimodal distributions.
A uniform frequency distribution, sometimes called a flat distribution, is a type of distribution where all the bars are about the same height. This type of distribution arises in scenarios where all the possible outcomes are equally likely. A uniform distribution is also symmetric.
As an example, the possible outcomes of rolling a fair six-sided die are 1, 2, 3, 4, 5, and 6, and they each have an equal probability of occurring. The following applet simulates rolling a die 100 times and records the frequency of each outcome.A bimodal distribution is a data distribution with a range of values near two individual values or two intervals, separating the data into two clusters. This causes the histogram of the data to have two peaks. The mean and the median of a bimodal distribution are near the center of the distribution.
The given distribution indicates that the sampling was likely made from two different populations. The term bimodal refers to the peaks of the distribution, which differs from the mode when intervals are used to make the data display. It is worth mentioning that a bimodal distribution whose bars are about the same height on each side of the peaks is also symmetric.
Consider a histogram that shows the attendance per hour at a local restaurant.
The peaks represent typical lunch and dinner hours. Since the histogram has two distinct peaks, it has a bimodal distribution. Traffic patterns, heights, and test scores are other examples that can show bimodal distribution. Although histograms are mainly used to show bimodality, other representations such as dot plots and leaf plots can also show bimodality.Color | Frequency |
---|---|
Red | 164 |
Blue | 168 |
Yellow | 168 |
Pink | 166 |
Green | 165 |
Orange | 169 |
Tadeo collected the other data set from a survey about the exam scores of their classmates.
Exam Score | Frequency |
---|---|
70−71 | 3 |
72−73 | 8 |
74−75 | 11 |
76−77 | 9 |
78−79 | 4 |
80−81 | 2 |
82−83 | 2 |
84−85 | 4 |
86−87 | 8 |
88−89 | 11 |
90−91 | 9 |
92−93 | 7 |
94−95 | 2 |
They now have two data sets and would like to display the data in histograms to analyze them.
peaksdoes the histogram have? Draw a vertical line through the middle of the distribution.
Colorand the vertical axis
Frequency.Then draw the bars to represent the frequency of each color outcome.
Test Scoreand the vertical axis will be the
Frequency.
peaks.Additionally, these two peaks split data into two clusters. This means that the data follows a bimodal distribution. Moreover, suppose a vertical line is drawn around the halfway line of the distribution. The data on the left is an approximate mirror image of the data on the right.
This means that the distribution is bimodal and symmetric. A possible explanation for this data is that it comes from two groups, one group of students who did not study for the exam (the first peak
on the left) and one group that did study for the exam (the second peak
on the right).
A box plot is another data display that allows one to see the shape of a frequency distribution. The length of the whiskers
and the position of the median tell whether the distribution is skewed or symmetric.
During their fantastic journey exploring the restaurants in their neighborhood, Tadeo and Ramsha found a fabulous Italian restaurant. While eating their food, they observed the people who entered the restaurant.
The two are curious about the average age of people eating at this restaurant. Therefore, they decide to collect data on the ages of people who enter the restaurant during a typical day.
Ages of People Who Enter the Italian Restaurant on a Typical Day | |||||
---|---|---|---|---|---|
15 | 53 | 55 | 60 | 38 | 56 |
62 | 14 | 44 | 24 | 32 | 10 |
42 | 54 | 47 | 67 | 60 | 50 |
61 | 30 | 30 | 62 | 62 | 65 |
56 | 52 | 35 | 25 | 34 | 32 |
They now want to draw some conclusions from this data set by displaying it in a box plot.
Five Number Summary | |
---|---|
Minimum Value | 10 |
First Quartile | 32 |
Median | 48.5 |
Third Quartile | 60 |
Maximum Value | 67 |
Now, draw a box from the first quartile to the third quartile. Then, draw a line through the median and the whiskers from the box to the minimum and maximum values.
Notice that this plot corresponds to option A.
Therefore, the option that states that 50% of the people who enter the Italian restaurant in a regular day are between 32 and 60 years old is the right one.
This gives the two correct statements about the girls' data set.
Appropriate Measures of Center | |
---|---|
Girls' Data Set | Boys' Data Set |
Mean | Median |
To identify who is more likely to spend more time on video games, compare these measures of center. Since the girls' distribution is symmetric, its mean is probably in the 6.1-8 interval, the center of the distribution.
Conversely, because in a skewed distribution, the median is righter to the center and closer to the peaks of the distribution, it is probably in the 8.1−10 interval.
Comparing these values, notice that the median of the boys is greater than the mean of the girls.Ramsha and Tadeo took a survey of their classmates about the number of hours spent playing video games on the weekends. These results made Ramsha more interested in finding other everyday situations with remarkable differences due to gender. She decided to search the web for similar examples.
During her investigation, Ramsha found a peculiar table showing the results of a survey comparing the amount of money that men and women usually spend on clothes per month.
Women | Men | |
---|---|---|
Survey Size | 100 | 100 |
Minimum | $18 | $8 |
Maximum | $60 | $28 |
1st Quartile | $30 | $14 |
Median | $34 | $18 |
3rd Quartile | $40 | $22 |
Mean | $36 | $18 |
Standard Deviation | $8 | $4 |
Next, draw the box for each plot using the first and third quartiles. Finally, draw a line through the median and the whiskers from the box to the minimum and maximum values of each data set.
Notice that this corresponds to the box-plot in option D.
Standard Deviation | Interquartile Range | |
---|---|---|
Women | $8 | 60−18=$42 |
Men | $4 | 28−8=$20 |
Both the standard deviation and the interquartile range are greater for women. This means that there is more variability in the amount of money spent by women.
All the pieces to analyzing data using histograms have been covered. This method of displaying data makes it easier to find the data distribution and determine the best measures of center and variation to describe the data set. Recall the data Tadeo and Ramsha recorded about the main dishes of the restaurants in their neighborhood at the beginning of the lesson.
Average Main Dish Price (Dollars) | ||||
---|---|---|---|---|
10.12 | 9.29 | 8.29 | 9.78 | 10.69 |
9.68 | 12.09 | 8.94 | 10.81 | 8.62 |
11.39 | 12.62 | 8.71 | 10.74 | 10.52 |
10.77 | 10.15 | 9.18 | 8.45 | 9.52 |
11.89 | 9.77 | 9.44 | 13.24 | 11.01 |
10.62 | 9.38 | 12.15 | 9.68 | 9.60 |
10.32 | 11.31 | 11.41 | 8.62 | 9.27 |
10.96 | 9.18 | 10.28 | 10.71 | 10.02 |
The two students wanted to draw some insights and conclusions based on this data. However, they were not entirely sure how to proceed with this task. Find the following information to help these curious connoisseurs!
Average Main Dish Price (Dollars) | |
---|---|
Price Range | Frequency |
8.00−8.99 | 6 |
9.00−9.99 | 12 |
10.00−10.99 | 13 |
11.00−11.99 | 5 |
12.00−12.99 | 3 |
13.00−13.99 | 1 |
Price Rangeand the vertical axis the
Frequency.Then, draw the bars to represent the frequency of each interval.
Distribution | Measure of Center | Measure of Variation |
---|---|---|
Symmetric | Mean | Standard deviation |
Skewed | Median | Five-number summary |
Because in this situation the distribution of the data is skewed, the median and the five-number summary best describe the center and variation of the data, respectively.