Categorical data, also called qualitative data, is data that can be split into groups. In other words, it is data belonging to one or more categories that have a fixed number of possible outcomes or values. An example of categorical data is the continent in which a country lies.
Every country on Earth is generally accepted to lie within one of these seven continents, and can therefore becategorizedas belonging to one continent or another.
A frequency table is a type of table that is used to present values and their frequencies for a particular data set. It lists the possible values or outcomes of the category, and then it lists how many times each value or outcome is observed. As an example, the results of asking a group of ten people about their favorite animals can be presented in a frequency table.
Preference | Frequency |
---|---|
Cats | 4 |
Dogs | 4 |
Quokkas | 1 |
Turtles | 1 |
To begin, we can set up the table. The right column should show frequency, so the left column is "number of pets." Let's add the groups to the left column.
Number of pets | Frequency |
---|---|
0 | |
1−2 | |
3−5 | |
6+ |
We can now look at the data and count how many times an answer was given in each group. For instance, five of the classmates answered $"0,"$ and three gave an answer of 1 or 2. Filling in the entire table like this gives us the desired frequency table.
Number of pets | Frequency |
---|---|
0 | 5 |
1−2 | 3 |
3−5 | 4 |
6+ | 2 |
A two-way frequency table, also known as a two-way table, is a table that displays categorical data that can be grouped into two categories. One of the categories is represented by the rows of the table, the other by the columns. For example, the table below shows the results of a survey in which 100 participants were asked if they have a driver's license and if they own a car.
Here, the two categories are car
and driver's license,
both with possible answers of yes
and no.
The entries in the table are called joint frequencies. Two-way frequency tables often include the total of the rows and columns. These totals are called marginal frequencies.
Totalrow and the
Totalcolumn is equal to the sum of all joint frequencies and is called the grand total. In the case of the survey, the grand total is 100. From the table it can be read that, among other things, 43 people both have a driver's license and own a car. It can also be read that 33 people do not have a driver's license.
Organizing data in a two-way frequency table can help with visualization, which in turn makes it easier to analyze and present the data. To draw a two-way frequency table, three steps must be followed.
Suppose that 53 people took part in an online survey, where they were asked whether they prefer top hats or berets. Out of the 18 males that participated, 12 of them prefer berets. Also, 15 of the females chose top hats as their preference. The steps listed above will be developed for this example.
First, the two categories of the table must be determined, after which the table can be drawn without frequencies. Here, the participants gave their hat preference and their gender, which are the two categories. Hat preference can be further divided into top hat and beret, and gender into female and male.
The total row and total column are included to write the marginal frequencies.
The given joint and marginal frequencies can now be added to the table.
In a two-way frequency table, a joint relative frequency is the ratio of a joint frequency to the grand total. Similarly, a marginal relative frequency is the ratio of a marginal frequency to the grand total. Consider an example two-way table.
Here, the grand total is 100. The joint and marginal frequencies can now be divided by 100 to obtain the $joint$ and $marginal$ relative frequencies. Clicking in each cell will display its interpretation.
A conditional relative frequency is the ratio of a joint frequency to either of its corresponding two marginal frequencies. Alternatively, it can be calculated using joint and marginal relative frequencies. As an example, the following data will be used.
Using the column totals, the left column of joint frequencies should be divided by 67, and the right column by 33. Since the column totals are used, the sum of the conditional relative frequencies of each column is 1.
The resulting two-way frequency table can be interpreted to obtain the following information.
Studying the conditional relative frequencies of a two-way frequency table, it is possible to find potential associations in the data. As an example, the following survey results will be used.
Driver's license | ||||
Yes | No | Total | ||
Car | Yes | 43 | 4 | 47 |
No | 24 | 29 | 53 | |
Total | 67 | 33 | 100 |
Using the column totals, the conditional relative frequencies can be calculated.
Driver's license | |||
Yes | No | ||
Car | Yes | $6743 ≈0.64$ | $334 ≈0.12$ |
No | $6724 ≈0.36$ | $3329 ≈0.88$ |
It can be seen that among people with a driver's license, having a car is common, and among those without a license, owning a car is uncommon. Thus, it can be reasoned that there is an association between having a driver's license and owning a car. Finding the conditional relative frequencies using the row totals instead, gives a slightly different result.
Driver's license | |||
Yes | No | ||
Car | Yes | $4743 ≈0.91$ | $474 ≈0.09$ |
No | $5324 ≈0.45$ | $5329 ≈0.55$ |
Here, it is shown that among car owners, almost everyone has a driver's license, but among those without a car, roughly half have a driver's license. This isn't as obvious, but it shows a tendency of relating car ownership with having a driver's license, which further confirms the association. In some cases, it is obvious that answers in one category might be the result of the other category, such as in the following example.
Bed time | |||
Before 9.30 a.m. | After 9.30 a.m. | ||
Age | 10-12 | 17 | 4 |
13-15 | 23 | 27 | |
16-18 | 2 | 17 |
A person's bed time might be dependent on their age, but their age is not dependent on their bed time. Because of this, it is recommended to use the age totals when finding the conditional relative frequencies. This gives the distribution of bed time given a certain age span, which will clearly show any association.
Bed time | |||
Before 9.30 a.m. | After 9.30 a.m. | ||
Age | 10-12 | $2117 ≈0.81$ | $214 ≈0.19$ |
13-15 | $5023 =0.46$ | $5027 =0.54$ | |
16-18 | $192 ≈0.11$ | $1917 ≈0.89$ |
Eugenia is passionate about two things in particular, hot air balloons and forks. Lately, she's run an online survey, where people answer if they have ever flown a hot air balloon and how many forks they have, urging all her friends to share the link to it. She's now finally made a post of the results:
"Thank you, all 1105 participants. More than I predicted, 75 of you, have flown in a hot air balloon. Out of these 75, 44 have between eleven and twenty forks, and 22 have between six and ten forks. In total, 312 people have between six and ten forks, and 583 people have never flown in a hot air balloon and have between eleven and twenty forks."
Help her visualize the data by drawing a two-way frequency table including all joint and marginal frequency. Then, draw a two-way table with joint relative and marginal relative frequencies. Finally, find and use the conditional relative frequencies to determine if there are any apparent associations in the data.
To begin, we'll establish the different categories for this data set. Based on Eugenia's questions, we can sort the data into two categories: "hot air balloon" and "forks." Next, we'll draw a two-way frequency table that organizes Eugenia's results.
Hot air balloon | ||||
Yes | No | Total | ||
Forks | 0-5 | |||
6-10 | 22 | 312 | ||
11-20 | 44 | 583 | ||
Total | 75 | 1105 |
Notice that the "Yes" column, the "11-20" row, the "Total" row, and the "6-10" row each have only one cell missing. Thus, we can complete each by reasoning.
Hot air balloon | ||||
Yes | No | Total | ||
Forks | 0-5 | 9 | ||
6-10 | 22 | 290 | 312 | |
11-20 | 44 | 583 | 627 | |
Total | 75 | 1030 | 1105 |
The remaining two cells can be found by reasoning in the same way. First, we'll find the number of people who have not ridden in a hot air balloon and own 0-5 forks, then we'll find the remaining total.
Hot air balloon | ||||
Yes | No | Total | ||
Forks | 0-5 | 9 | 157 | 166 |
6-10 | 22 | 290 | 312 | |
11-20 | 44 | 583 | 627 | |
Total | 75 | 1030 | 1105 |
Now that we have complete two-way table, we can see the joint and marginal frequencies for Eugenia's data. To find the joint relative and marginal relative frequencies, we'll divide each frequency by the total number of participants, 1105.
Hot air balloon | ||||
Yes | No | Total | ||
Forks | 0-5 | $11059 ≈0.01$ | $1105157 ≈0.14$ | $1105166 ≈0.15$ |
6-10 | $110522 ≈0.02$ | $1105290 ≈0.26$ | $1105312 ≈0.28$ | |
11-20 | $110544 ≈0.04$ | $1105583 ≈0.53$ | $1105627 ≈0.57$ | |
Total | $110575 ≈0.07$ | $11051030 ≈0.93$ | 1 |
From the relative frequencies above, we can notice trends in Eugenia's data. For instance, only $7%$ of participants have ridden in a hot air balloon, and $57%$ own between 11 and 20 forks. Lastly, we can calculate the conditional relative frequencies using either the row or the column totals. Here, we'll arbitrarily use the column totals.
Hot air balloon | |||
Yes | No | ||
Forks | 0-5 | $759 ≈0.12$ | $1030157 ≈0.15$ |
6-10 | $7522 ≈0.29$ | $1030290 ≈0.28$ | |
11-20 | $7544 ≈0.59$ | $1030583 ≈0.57$ |
For both groups of people, those who have and have not ridden in a hot air balloon, few have between 0 and 5 forks, while more than half have between 11 and 20 forks.