| {{ 'ml-lesson-number-slides' | message : article.intro.bblockCount }} |
| {{ 'ml-lesson-number-exercises' | message : article.intro.exerciseCount }} |
| {{ 'ml-lesson-time-estimation' | message }} |
Here is a recommended readings before getting started with this lesson.
Emily and Ignacio love learning about animals. They believe they can make meaningful discoveries by studying data about any animal, beginning with cats. They choose to create a data set, consisting of seven data points, showing the lifespan of cats in their neighborhood. They surveyed their neighbors to get this information.
Lifespan of Cats (in years) | |||
---|---|---|---|
15 | 11 | 14 | 15 |
14 | 17 | 13 |
Answer the following questions using this data set.
A data set is a collection of values that provides information. These values can be presented in various ways such as in numbers or categories. The values are typically gathered through measurements, surveys, or experiments. Consider a data set that consists of the heights of a group of actors.
Actor | Height |
---|---|
Madzia | 5 ft 4 in. |
Magda | 5 ft 2 in. |
Ignacio | 6 ft 1.6 in. |
Henrik | 5 ft 10 in. |
Ali | 6 ft 1 in. |
Diego | 5 ft 2 in. |
Miłosz | 5 ft 2 in. |
Paulina | 5 ft 3 in. |
Aybuke | 5 ft 7 in. |
Mateusz | 6 ft 1.2 in. |
Gamze | 5 ft 3 in. |
Marcin | 5 ft 7 in. |
Marcial | 5 ft 8 in. |
Heichi | 5 ft 5 in. |
Arkadiusz | 5 ft 6 in. |
Enrique | 5 ft 10.5 in. |
Aleksandra | 5 ft 4 in. |
Mateusz | 5 ft 9 in. |
Jordan | 5 ft 5 in. |
Paula | 5 ft 2 in. |
MacKenzie | 5 ft 6 in. |
Joe | 6 ft 1 in. |
Flavio | 5 ft 10 in. |
Jeremy | 5 ft 4 in. |
Umut | 6 ft 1 in. |
The mean, or the average, of a numerical data set is one of the measures of center. It is defined as the sum of all of the data values in a set divided by the number of values in the set.
Mean=Number of ValuesSum of Values
The following applet calculates the mean of the data set on the number line. Points can be moved to change the data values.
Ignacio volunteers at a dog shelter. He asks Emily to help him study a data set he made concerning the lifespan of some of the dogs. The information they gather will help the shelter! This time, the data set consists of eight data points rather than seven.
Lifespan of Dogs (in years) | |||
---|---|---|---|
10 | 21 | 16 | 15 |
13 | 15 | 17 | 11 |
Substitute values
Add terms
Calculate quotient
Similar to the measures of center, there are measures that describe how much the values in a data set differ from each other using only one measure. These measures summarize the spread of the data.
Range is a measure of spread that measures the difference between the maximum and minimum values of the data set.
The interquartile range, or IQR, of a data set is a measure of spread that measures the difference between Q3 and Q1, the upper and lower quartiles.
IQR=Q3−Q1
The following applet shows how to find the IQR of different data sets.
First, identify the median of the given data set. Since the number of values is even, the median is the mean of the two middle values.
The median of the data is 6.
The median divides the data into two halves, a lower half and an upper half. For this data, the lower half includes the first six values and the upper half includes the following six.
When there is an odd number of values in the data set, the middle value is excluded from both the lower and upper sets.
Find the first and the third quartile. The first quartile, Q1, is the median of the lower set, while the third, Q3, is the median of the upper set. Here, both quartiles are found the same way the median was found.
Here, it is necessary to order the values from least to greatest. Then identify the median of the given data set. Since the number of values is an odd number, the median is the middle value.
The median of the data is 9. Both the lower and upper halves contain four data values. Therefore, there are two middle values in each half. The median of each half is the mean of the two middle values.
The first quartile is 7.5, and the third quartile is 11. The difference between the third quartile and the first quartile is the interquartile range. The interquartile range of cat weights is 3.5 pounds.In this case, the data values are ordered from least to greatest and the number of values is an even number. This means that the median is the mean of the two middle values.
The median of the data is 33.5. Both the lower and upper halves contain five data values. Therefore, there is only one middle value in each half.
The first quartile is 29, and the third quartile is 44. The difference between the third quartile and the first quartile is the interquartile range. The interquartile range of dog weights is 15 pounds.A five-number summary of a data set consists of the following five values.
These values provide a summary of the central tendency and spread of the data set. The five-number summary is useful for understanding the variability in a data set. When the data set is written in numerical order, the median divides the data set into two halves. The median of the lower half is the first quartile Q1 and the median of the upper half is the third quartile Q3.
An outlier is a data point that is significantly different from the other values in the data set. It can be significantly larger or significantly smaller than the others.
Categorical data sometimes also have unusual elements; these can be called outliers as well.
However, it is best to use the term outlier when applying a mathematical process to identify it. This approach helps differentiate between an intuitive and a more formal approach.Significantly DifferentMean?
For numerical data, the following definition is one of the several approaches that can be used.
Such a value was suggested by the esteemed American mathematician John Tukey. Move the slider in the following applet to see which data point is an outlier.
After excluding the outlier, the number of values decreased by one. There are nine values now, so the median is the middle value.
The median of the data is 32. Both the lower and upper halves contain four data values. Therefore, there are two middle values in each half. The median of each half is the mean of the two middle values.
The first quartile is 23.5, and the third quartile is 40.5. The difference between the third quartile and the first quartile is the interquartile range. The interquartile range of the data when the outlier is taken out of the data set is 17.Range | IQR | |
---|---|---|
With Outliers | 68 | 15 |
Without Outliers | 44 | 17 |
After removing the outlier from the data, the range decreased from 68 to 44, while the IQR increased from 15 to 17. This example shows that outliers have a bigger impact on the range of values than on the IQR.
Measures of spread, such as the range and interquartile range, indicate how much data values varies, while outliers are values that significantly deviate from the rest. Practice calculating these measures for the given data.