| {{ 'ml-lesson-number-slides' | message : article.intro.bblockCount }} |
| {{ 'ml-lesson-number-exercises' | message : article.intro.exerciseCount }} |
| {{ 'ml-lesson-time-estimation' | message }} |
Most of the data is grouped next to the mean value, which is 9. Therefore, a man who wears a size 9.5 shoe is more likely to be randomly selected than a man who wears a size 11.5 shoe. When a data set is distributed this way and the domain of the distribution is continuous — not discrete — it is said that the data is normally distributed. This lesson explores this distribution.
Here are a few recommended readings before getting started with this lesson.
Kevin has a summer internship at a tech company in his town. The daily number of calls that the company receives is normally distributed with a mean of 2240 calls and a standard deviation of 150 calls. The graph represents the distribution of the data.
Looking to make improvements in the company, Kevin's boss is interested in knowing the answers to the next couple of questions.
When dealing with probability distributions, there is one type that stands out above the rest because it is very common in different real-life scenarios like people's heights, shoe sizes, birth weights, average grades, IQ levels, and many qualities. Because of this regularity, this type of distribution is called the normal distribution.
A normal distribution is a type of probability distribution where the mean, the median, and the mode are all equal to each other. The graph that represents a normal distribution is called a normal curve and it is a continuous, bell-shaped curve that is symmetric with respect to the mean μ of the data set.
This type of distribution is the most common continuous probability distribution that can be observed in real life. When a normal distribution has a mean of 0 and standard deviation of 1, it is called a standard normal distribution.
The total area under the normal curve is 100%, or 1. Because of this, the area under the normal curve in a certain interval represents the percentage of data within that interval or the probability of randomly selecting a value that belongs to that interval. The Empirical Rule can be used to determine the area under the normal curve at specific intervals. It is also worth noting that not all data sets are normally distributed. If the mean and median are not equal, then the data set is skewed.In statistics, the Empirical Rule, also known as the 68–95–99.7 rule, is a shorthand used to remember the percentage of values that lie within certain intervals in a normal distribution. The rule states the following three facts.
Empirical Rule.
In his spare time, Kevin works with the Less Chat, More Talk campaign to encourage people to share with their loved ones in person instead of through screens. He wants to give away T-shirts with a cool logo outside a shopping mall to help spread this message.
Kevin is in charge of preparing the men's T-shirts, but he does not know how many of each size he should order. To figure it out, he searched the City Hall website and he found that the heights of the men in the city are normally distributed with a mean of 183 centimeters and a standard deviation of 5 centimeters. Along with this information, there was also a graph.
Due to the symmetry of the normal curve, 2.5% of the data fall to the left of 173 and 2.5% of the data fall to the right of 193. Consequently, 2.5% of the men surveyed are shorter than 173 centimeters.
According to the graph, 13.5% of the surveyed men are between 188 and 193 centimeters tall.
To find the number of men that belong to this range, multiply the corresponding percentage by the total number of men that participated in the survey.a%=100a
ca⋅b=ca⋅b
Calculate quotient
First, draw a horizontal axis and mark the mean of the data in the middle. In this case, the mean is 10.
Find more labels to write on the axis such that each interval is one standard deviation long. In this case, the intervals must be 2 units long. To accomplish this, add and subtract multiples of the standard deviation to and from the mean.
Labels to the Left of the Mean | Labels to the Right of the Mean |
---|---|
10−1⋅2=8 | 10+1⋅2=12 |
10−2⋅2=6 | 10+2⋅2=14 |
10−3⋅2=4 | 10+3⋅2=16 |
Adding three labels to each side of the mean is enough.
Lastly, draw a bell-shaped curve with its peak at the mean. Remember, the curve is symmetric with respect to the mean. In this case, the peak occurs at 10.
While reading some statistics about the people in the city, Kevin was surprised to learn that the weights of newborns are also normally distributed. He found the following information given by the local hospital.
Next, draw the normal curve — a bell-shaped curve that is symmetric with respect to the mean, where it has its peak.
According to the Empirical Rule, the percentages below the curve are distributed as follows.
The percentages in every interval can be labeled by using the symmetry of the curve. This will complete the diagram of the distribution.
The height of people is usually normally distributed. For example, the average height of a woman in the United States is about 162.5 centimeters. Assuming a standard deviation of 2.5 centimeters, the graph of this distribution looks as follows.
The Empirical Rule is used to determine the percentage of data that falls between any two labels on the axis. However, what about if the endpoints of the interval are different from the labels? For example, what is the percentage of women that are shorter than 166 centimeters?
To find such a percentage, the first step is converting the data value into its corresponding z-score.
The z-score, also known as the z-value, represents the number of standard deviations that a given value x is from the mean of a data set. The following formula can be used to convert any x-value into its corresponding z-score.
z=σx−μ
Consider a standard normal distribution and a randomly chosen z-score. The area below the normal curve that is to the left of this z-score can be calculated using a standard normal table. For example, consider z=0.6.
The percentage of data that is less than or equal to z can be determined following three steps.In the left column of the standard normal table, locate the whole part of the z-score. Since z=0.6 is positive, look at the four bottom rows. Because the whole part of z is 0, shade the fifth row.
.0 | .1 | .2 | .3 | .4 | .5 | .6 | .7 | .8 | .9 | |
---|---|---|---|---|---|---|---|---|---|---|
-3 | .00135 | .00097 | .00069 | .00048 | .00034 | .00023 | .00016 | .00011 | .00007 | .00005 |
-2 | .02275 | .01786 | .01390 | .01072 | .00820 | .00621 | .00466 | .00347 | .00256 | .00187 |
-1 | .15866 | .13567 | .11507 | .09680 | .08076 | .06681 | .05480 | .04457 | .03593 | .02872 |
-0 | .50000 | .46017 | .42074 | .38209 | .34458 | .30854 | .27425 | .24196 | .21186 | .18406 |
0 | .50000 | .53983 | .57926 | .61791 | .65542 | .69146 | .72575 | .75804 | .78814 | .81594 |
1 | .84134 | .86433 | .88493 | .90320 | .91924 | .93319 | .94520 | .95543 | .96407 | .97128 |
2 | .97725 | .98214 | .98610 | .98928 | .99180 | .99379 | .99534 | .99653 | .99744 | .99813 |
3 | .99865 | .99903 | .99931 | .99952 | .99966 | .99977 | .99984 | .99989 | .99993 | .99995 |
The probability that corresponds to a z-score for which the integer part is 0 appears in the shaded row.
In the top row of the standard normal table, locate the decimal part of the z-score. Here, the decimal part is 6. Consequently, shade the seventh column.
.0 | .1 | .2 | .3 | .4 | .5 | .6 | .7 | .8 | .9 | |
---|---|---|---|---|---|---|---|---|---|---|
-3 | .00135 | .00097 | .00069 | .00048 | .00034 | .00023 | .00016 | .00011 | .00007 | .00005 |
-2 | .02275 | .01786 | .01390 | .01072 | .00820 | .00621 | .00466 | .00347 | .00256 | .00187 |
-1 | .15866 | .13567 | .11507 | .09680 | .08076 | .06681 | .05480 | .04457 | .03593 | .02872 |
-0 | .50000 | .46017 | .42074 | .38209 | .34458 | .30854 | .27425 | .24196 | .21186 | .18406 |
0 | .50000 | .53983 | .57926 | .61791 | .65542 | .69146 | .72575 | .75804 | .78814 | .81594 |
1 | .84134 | .86433 | .88493 | .90320 | .91924 | .93319 | .94520 | .95543 | .96407 | .97128 |
2 | .97725 | .98214 | .98610 | .98928 | .99180 | .99379 | .99534 | .99653 | .99744 | .99813 |
3 | .99865 | .99903 | .99931 | .99952 | .99966 | .99977 | .99984 | .99989 | .99993 | .99995 |
Other areas can also be found using the same standard normal table.
To find the area below the normal curve and between two z-scores, subtract the area to the left of the smaller z-score from the area to the left of the greater z-score.
The area to the right of a z-score is the complement of the area to the left of the same z-score.
Since the area under the normal curve represents a probability, by the Complement Rule, these two probabilities add up to 1.P(z>z1)=1−P(z≤z1)
According to the standard normal table, the probability that a randomly selected value is less than or equal to 1.4 is 0.91924. Therefore, about 91.92% of women are shorter than or equal to 166 centimeters.
Kevin has become a stats fan. He has recorded the time it takes him to commute to his internship over the past few days. He observes that the times are normally distributed with a mean of 17 minutes and a standard deviation of 2.5 minutes.
Find the following probabilities and write them in decimal form rounded to two decimal places.
The probability that Kevin spends less than 14 minutes getting to work tomorrow is represented by the area below the curve that is to the left of 14.
Since 14 is not a label on the axis, the Empirical Rule cannot be used. Therefore, to find the area, first convert x=14 into its corresponding z-score.Substitute values
Subtract term
Put minus sign in front of fraction
Calculate quotient
.0 | .1 | .2 | .3 | .4 | .5 | .6 | .7 | .8 | .9 | |
---|---|---|---|---|---|---|---|---|---|---|
-3 | .00135 | .00097 | .00069 | .00048 | .00034 | .00023 | .00016 | .00011 | .00007 | .00005 |
-2 | .02275 | .01786 | .01390 | .01072 | .00820 | .00621 | .00466 | .00347 | .00256 | .00187 |
-1 | .15866 | .13567 | .11507 | .09680 | .08076 | .06681 | .05480 | .04457 | .03593 | .02872 |
-0 | .50000 | .46017 | .42074 | .38209 | .34458 | .30854 | .27425 | .24196 | .21186 | .18406 |
0 | .50000 | .53983 | .57926 | .61791 | .65542 | .69146 | .72575 | .75804 | .78814 | .81594 |
1 | .84134 | .86433 | .88493 | .90320 | .91924 | .93319 | .94520 | .95543 | .96407 | .97128 |
2 | .97725 | .98214 | .98610 | .98928 | .99180 | .99379 | .99534 | .99653 | .99744 | .99813 |
3 | .99865 | .99903 | .99931 | .99952 | .99966 | .99977 | .99984 | .99989 | .99993 | .99995 |
According to the table, the probability that tomorrow Kevin will spend less than 14 minutes traveling to work is about 0.12.
Therefore, both values will need to be converted into their corresponding z-scores first. Recall that μ=17 and σ=2.5!
z=σx−μ | ||
---|---|---|
x-value | Substitute | Simplify |
16 | z=2.516−17 | z=-0.4 |
19 | z=2.519−17 | z=0.8 |
.0 | .1 | .2 | .3 | .4 | .5 | .6 | .7 | .8 | .9 | |
---|---|---|---|---|---|---|---|---|---|---|
-3 | .00135 | .00097 | .00069 | .00048 | .00034 | .00023 | .00016 | .00011 | .00007 | .00005 |
-2 | .02275 | .01786 | .01390 | .01072 | .00820 | .00621 | .00466 | .00347 | .00256 | .00187 |
-1 | .15866 | .13567 | .11507 | .09680 | .08076 | .06681 | .05480 | .04457 | .03593 | .02872 |
-0 | .50000 | .46017 | .42074 | .38209 | .34458 | .30854 | .27425 | .24196 | .21186 | .18406 |
0 | .50000 | .53983 | .57926 | .61791 | .65542 | .69146 | .72575 | .75804 | .78814 | .81594 |
1 | .84134 | .86433 | .88493 | .90320 | .91924 | .93319 | .94520 | .95543 | .96407 | .97128 |
2 | .97725 | .98214 | .98610 | .98928 | .99180 | .99379 | .99534 | .99653 | .99744 | .99813 |
3 | .99865 | .99903 | .99931 | .99952 | .99966 | .99977 | .99984 | .99989 | .99993 | .99995 |
P(z<0.8)=0.78814, P(z<-0.4)=0.34458
Subtract term
Round to 2 decimal place(s)
Since 19 is not a label on the axis, the Empirical Rule cannot be used. Therefore, z-scores must be used to find the area. In Part B it was determined the z-score that corresponds to 19 is 0.8.
Probability of Kevin Being Late | Probability of Kevin Being on Time |
---|---|
P(z>0.8) | P(z≤0.8) |
P(z≤0.8)=0.78814
LHS−0.78814=RHS−0.78814
Round to 2 decimal place(s)
The company Kevin is interning with plans to release a new smartphone. He goes with the research team to a stadium with a prototype to let different people use the phone in order to determine what features and design people like.
After comparing and contrasting size preference with the ages of the participants, Kevin realizes that the data is normally distributed. Additionally, he notices that the middle 46% of participants prefer a larger phone.
Due to the symmetry of the normal curve, the area to the left of z1 is equal to the area to the right of z2. Therefore, each portion corresponds to 54÷2=27% of the data. For the moment, focus on the area to the left of z1.
According to the last graph, the probability that a randomly chosen value is less than z1 is 0.27. In other words, P(z<z1)=0.27. Now, look for the z-value that produces a probability of 0.27 on a standard normal table.
.0 | .1 | .2 | .3 | .4 | .5 | .6 | .7 | .8 | .9 | |
---|---|---|---|---|---|---|---|---|---|---|
-3 | .00135 | .00097 | .00069 | .00048 | .00034 | .00023 | .00016 | .00011 | .00007 | .00005 |
-2 | .02275 | .01786 | .01390 | .01072 | .00820 | .00621 | .00466 | .00347 | .00256 | .00187 |
-1 | .15866 | .13567 | .11507 | .09680 | .08076 | .06681 | .05480 | .04457 | .03593 | .02872 |
-0 | .50000 | .46017 | .42074 | .38209 | .34458 | .30854 | .27425 | .24196 | .21186 | .18406 |
0 | .50000 | .53983 | .57926 | .61791 | .65542 | .69146 | .72575 | .75804 | .78814 | .81594 |
1 | .84134 | .86433 | .88493 | .90320 | .91924 | .93319 | .94520 | .95543 | .96407 | .97128 |
2 | .97725 | .98214 | .98610 | .98928 | .99180 | .99379 | .99534 | .99653 | .99744 | .99813 |
3 | .99865 | .99903 | .99931 | .99952 | .99966 | .99977 | .99984 | .99989 | .99993 | .99995 |
It is seen in the table that z1=-0.6. Again, due to symmetry, z2 is the opposite of z1. Therefore, z2=0.6.
Therefore, the limits of the middle 46% of the data are z=-0.6 and z=0.6.
Substitute values
(-a)b=-ab
a+(-b)=a−b
Subtract term
Any normal distribution with mean μ and standard deviation σ can be converted into a standard normal distribution. For example, consider a normal distribution with μ=35 and σ=1.22. To standardize the distribution, all its values have to be converted into their corresponding z-scores.
Since the domain is continuous, the conversion cannot be manually done for all the values. However, for illustrative purposes, it will be performed for the data set {33, 34, 34, 35, 35, 35, 36, 36, 37}. Two steps will be followed.First, shift all the values so that the mean of the new set is 0. To do this, subtract the mean 35 from each data value.
x | x−μ |
---|---|
33 | 33−35 |
34 | 34−35 |
34 | 34−35 |
35 | 35−35 |
35 | 35−35 |
35 | 35−35 |
36 | 36−35 |
36 | 36−35 |
37 | 37−35 |
Notice that translating the values will not changed the standard deviation. The standard deviation of the new data set is still 1.22.
The initial data set has been converted into {-2, -1, -1, 0, 0, 0, 1, 1, 2}.
To obtain a data set with a standard deviation of 1, divide the values obtained in the previous step by the standard deviation of the set.
x | σx−μ | z-Score |
---|---|---|
33 | 1.2233−35 | -1.64 |
34 | 1.2234−35 | -0.82 |
34 | 1.2234−35 | -0.82 |
35 | 1.2235−35 | 0 |
35 | 1.2235−35 | 0 |
35 | 1.2235−35 | 0 |
36 | 1.2236−35 | 0.82 |
36 | 1.2236−35 | 0.82 |
37 | 1.2237−35 | 1.64 |
After the standardization, the new data set is {-1.64, -0.82, -0.82, 0, 0, 0, 0.82, 0.82, 1.64}. Here, the mean is 0 and the standard deviation 1.
Notice that the resulting curve has a similar shape and distribution of data values as the original.
Kevin's friend LaShay took the SAT and scored 640 points on the math section. Kevin took the ACT and scored 28.32 points in the math section.
Since these tests use different scales — the math section of the SAT scores 800 points while the math section of the ACT scores 36 points — they wonder who did better. They looked at the stats for each test to find out.
Now Kevin's and LaShay's scores will be placed on the horizontal axis of their corresponding test. The score that is further to the right of the mean will tell who stood out the most compared to their class.
Unfortunately, it cannot be determined which score is further to the right of the mean just by looking at the graphs. Since the z-scores tell the number of standard deviations above or below the mean that a value is, it is convenient to find the corresponding z-scores.Score | Mean | Standard Deviation | z=σx−μ | z-score | |
---|---|---|---|---|---|
LaShay | 640 | 523 | 90 | z=90640−523 | z=1.3 |
Kevin | 26.52 | 21 | 6.1 | z=6.126.52−21 | z=1.2 |
LaShay's z-score is greater than Kevin's z-score. This means that her score is further to the right of the mean. Consequently, LaShay excelled more in her class than Kevin did in his.
.0 | .1 | .2 | .3 | .4 | .5 | .6 | .7 | .8 | .9 | |
---|---|---|---|---|---|---|---|---|---|---|
-3 | .00135 | .00097 | .00069 | .00048 | .00034 | .00023 | .00016 | .00011 | .00007 | .00005 |
-2 | .02275 | .01786 | .01390 | .01072 | .00820 | .00621 | .00466 | .00347 | .00256 | .00187 |
-1 | .15866 | .13567 | .11507 | .09680 | .08076 | .06681 | .05480 | .04457 | .03593 | .02872 |
-0 | .50000 | .46017 | .42074 | .38209 | .34458 | .30854 | .27425 | .24196 | .21186 | .18406 |
0 | .50000 | .53983 | .57926 | .61791 | .65542 | .69146 | .72575 | .75804 | .78814 | .81594 |
1 | .84134 | .86433 |