Categorical Data

 
 

Statistical Tests for Categorical Data

1.   Contingency Tables

Contingency tables summarize the observed frequencies to describe the relationship between two categorical variables.

 

Let’s assume we are at the painting exhibition. This exhibition presents paintings of three periods: Early Renaissance, Late Renaissance, and Baroque. On these paintings, you can see either Fruits, or Flowers, or a mix of both. You would like to count how many paintings from each of the mentioned periods have Fruits, how many have Flowers, and how many have a mix of both. This is what you get:

Fruits

Early Renaissance

11

Late Renaissance

8

Baroque

3

Flowers

5

6

10

Mix of both

1

8

12

Values inside are called joint frequencies because they relate to both categorical variables: to a certain period of time (Early Renaissance, Late Renaissance, and Baroque) and to a certain object drawn (Fruits, Flowers, or a mix of both). Marginal frequencies are a sum of each row and each column of our table, as marked in grey color below. You can also see how many paintings you have observed in total by looking at the bottom right corner of a table (64):

Fruits

Flowers

Mix of both

Early Renaissance

Late Renaissance

Baroque

11

5

1

8

6

8

10

3

12

21

22

21

17

22

25

64

In statistics, a contingency table is a type of table in a matrix format that displays the frequency distribution of the variables in terms of joint frequencies and marginal frequencies. They are heavily used in survey research, business intelligence, engineering and scientific research.

 

2.   Test on Independence

Test on independence in contingency tables is a test that determines if there is any relationship between two categorical variables. Variables are considered independent from each other if they have no relationship in-between.

 

Let’s imagine a random sample of 237 pupils were asked if they ever got into troubles at school. The result is the table below:

Trouble

Boys

46

Girls

37

No Trouble

71

83

There are two categorical variables: Gender (Boys/Girls) and Trouble Status (Trouble/No Trouble). Many believe that at schools, boys tend to get more into trouble than girls. Therefore, we would like to understand whether gender actually affects trouble status in school, so we would like to test for independence. In other words, given the data collected above, we would like to ask ourselves if there is a relationship between the gender of an individual and their status of trouble. There are different methods to test for independence, such as Pearson's Chi-squared tests or Fisher’s Exact test that are described further.

 

3.   Pearson's Chi-squared Test

Pearson's Chi-square Test (Chi-square Test hereafter) is a method that is used to test if there is any relationship between two categorical variables. In other words, a Chi-square test is a test for independence. Therefore, our Hypothesis statements are going to be the following:

Let’s get back to our example in section 1.2, where we had a random sample of 237 pupils who were asked if they ever got into troubles at school. The result is the table below:

Trouble

Boys

46

Girls

37

No Trouble

71

83

Now, in section 1.1. we defined joint and marginal frequencies. We see our joint frequencies in the table above. Let’s add marginal frequencies to this table as well (marked in grey). This is our result: 

Trouble

No Trouble

Boys

46

Girls

37

71

83

117

120

154

83

237

Because the number of pupils is varying per gender, it is hard to compare boys and girls in that way. Therefore, let’s standardize joint frequencies by dividing counts within each cell by the overall total. In addition, let’s standardize marginal frequencies by dividing each marginal frequency by the overall total (located on the bottom right corner - 237).

Trouble

No Trouble

Boys

46/237

Girls

37/237

71/237

83/237

117/237

120/237

154/237

83/237

237/237

Trouble

No Trouble

Boys

0.19

Girls

0.16

0.30

0.35

0.49

0.51

0.65

0.35

1

This way, joint frequencies become joint probabilities, or observed probabilities (marked in green): it takes into account two categorical variables. For instance, 0.19 is a probability of a trouble making boy P(B, Trouble). Marginal frequencies become marginal probabilities - it takes into account only one of the categorical variables. For instance, 0.49 is a probability of being a boy P(B), or 0.35 - a probability of being in trouble P(Trouble).

P(Boy)

P(Girl)

P(Trouble)

P(No Trouble)

Trouble

No Trouble

Boys

0.19

Girls

0.16

0.30

0.35

0.49

0.51

0.65

0.35

1

In the beginning, we outlined our hypothesis statements as follows:

Probabilities in statistics states that if two events are independent, the following equation is satisfied:

where X and Y are some events.

Chi-square test is based on this assumption. Therefore, if       is true, meaning that X and Y are independent, the following equation will be satisfied:

On the left side of this equation we see our joint probability, and on the right side of this equation, we see two marginal probabilities.

 

First, Chi-square uses this assumption to calculate expected probabilities (joint probabilities) from its marginal probabilities. For instance, expected probability for trouble making boys is P(B) * P(Trouble) = 0.35 * 0.49 = 0.17, or expected probability for trouble making girls is P(G) * P(Trouble) = 0.35 * 0.51 = 0.18 etc. Let's see all the expected probabilities in the brackets below: 

Trouble

No Trouble

Boys

0.19 (0.17)

Girls

0.16 (0.18)

0.30 (0.32)

0.35 (0.33)

0.49

0.51

0.65

0.35

1

In other words, when we calculate the expected probabilities, we calculate probabilities that we should expect if        is true, or, if X and Y are independent variables. That means that, if boys and trouble status are independent variables, our expected probability for good behaving boys is 0.32.

 

Secondly, we should measure the differences between the actual probabilities (actual joint probabilities in our tables) and those expected probabilities we have just calculated. If we see that the difference between our actual probabilities and the probabilities we expect to have (in case two categorical variables are independent) is huge, then our variables are most likely not independent. Similarly, if the difference between our actual probabilities in our table and probabilities that we suppose to get (in case of independence) is small our two variables are most likely independent. The difference between these two probabilities is represented by        value that we have to calculate using the formula:

where

  • N = total count

  • observed = observed count for each cell

  • expected = expected count for each cell (to be discussed later)

  • observed.p = observed joint probability for each cell

  • expected.p = expected joint probability for each cell

So, we just plug the values from the table above: 

Last step is to compare       value with the value in the        distribution table (denoted as            ) to conclude if you should accept or reject       . The following procedures applies:

- accept        : meaning that you have enough statistical to conclude that two variables are dependent.

- reject       : meaning that you have enough statistical to conclude that two variables are not dependent.

Let's look at the        distribution table and pick the right           .

To get a proper       from the table, we have to know two things: Significance level and degrees of freedom. In the table, significance level (    ) is on the top, and degrees of freedom (    ) is on the left side.

In our contingency table we have 2 rows and 2 columns, so:

Now we have both         and             values and can compare these two to make a decision.

  1. The significance level is something that you choose yourself. Let’s use a significance level of 5%, so       = 0.05.

  2. Degrees of freedom in Chi-squared test for independence is calculated using a formula:

If        is small, this implies the observed count (or probability) is close to the expected count (or probability) (under the assumption that the two variables are independent).

So, any estimated        value which is below our            = 3.8415 means that these two variables are independent, or have no relationships in-between.

With significance level of 0.05 and degree of freedom of 1, we have            = 3.8415 (first row, second column) 

degree of freedom (    )

As our                     , we can reject       and conclude that the gender and trouble status are not correlated with each other.

4.   Conclusion

So, here are the steps to make a Chi-square test:

 

  1. Add marginal frequencies to a contingency table

  2. Translate joint and marginal frequencies into probabilities

  3. Estimate the expected probability for each cell

  4. Calculate   

  5. Compare       with            and make a decision

5.   Remarks

  • The Pearson's Chi-squared test can be used to test for three scenarios: independence, goodness-of-fit, and homogeneity. The procedure for these three tests is similar. 

  • There are a few types of Chi-squared tests, with the Pearson's Chi-squared test being one of them.

  • Pearson's Chi-square test is an approximate test. It relies on the theoretical assumption that the sample size is large, and approximate the test using the     -distribution. If we have small cell counts, we can use Fisher's exact test

  • Instead of working with the expected probability, we can also work with the expected counts. The expected counts is the expected joint frequency if the two variables are independent. To calculate the expected count, simply multiply the corresponding marginal probabilities together with the overall count. For example, the expected count of trouble making boys will be

N * P(Boys) * P(Trouble) = (237)•(0.49)•(0.35) ≈ 40.65.

Following this we can create a similar table as before:

Trouble

No Trouble

Boys

46 (40.65)

Girls

37 (42.65)

71 (75.48)

83 (78.57)

0.49

0.51

0.65

0.35

237

We can then calculate the       test statistics using the first equation:

Theoretically, the test statistics calculated from this approach should match with the previous approach. The difference is simply due to the rounding issue.

6.   R Script

You can find Pearson's Chi-squared implemented in R following this link. Note that, even though the file is in python notebook, it is actually an R script.