Categorical Data
Statistical Tests for Categorical Data
1. Contingency Tables
Contingency tables summarize the observed frequencies to describe the relationship between two categorical variables.
Let’s assume we are at the painting exhibition. This exhibition presents paintings of three periods: Early Renaissance, Late Renaissance, and Baroque. On these paintings, you can see either Fruits, or Flowers, or a mix of both. You would like to count how many paintings from each of the mentioned periods have Fruits, how many have Flowers, and how many have a mix of both. This is what you get:
Fruits
Early Renaissance
11
Late Renaissance
8
Baroque
3
Flowers
5
6
10
Mix of both
1
8
12
Values inside are called joint frequencies because they relate to both categorical variables: to a certain period of time (Early Renaissance, Late Renaissance, and Baroque) and to a certain object drawn (Fruits, Flowers, or a mix of both). Marginal frequencies are a sum of each row and each column of our table, as marked in grey color below. You can also see how many paintings you have observed in total by looking at the bottom right corner of a table (64):
Fruits
Flowers
Mix of both
Early Renaissance
Late Renaissance
Baroque
11
5
1
8
6
8
10
3
12
21
22
21
17
22
25
64
In statistics, a contingency table is a type of table in a matrix format that displays the frequency distribution of the variables in terms of joint frequencies and marginal frequencies. They are heavily used in survey research, business intelligence, engineering and scientific research.
2. Test on Independence
Test on independence in contingency tables is a test that determines if there is any relationship between two categorical variables. Variables are considered independent from each other if they have no relationship inbetween.
Let’s imagine a random sample of 237 pupils were asked if they ever got into troubles at school. The result is the table below:
Trouble
Boys
46
Girls
37
No Trouble
71
83
There are two categorical variables: Gender (Boys/Girls) and Trouble Status (Trouble/No Trouble). Many believe that at schools, boys tend to get more into trouble than girls. Therefore, we would like to understand whether gender actually affects trouble status in school, so we would like to test for independence. In other words, given the data collected above, we would like to ask ourselves if there is a relationship between the gender of an individual and their status of trouble. There are different methods to test for independence, such as Pearson's Chisquared tests or Fisher’s Exact test that are described further.
3. Pearson's Chisquared Test
Pearson's Chisquare Test (Chisquare Test hereafter) is a method that is used to test if there is any relationship between two categorical variables. In other words, a Chisquare test is a test for independence. Therefore, our Hypothesis statements are going to be the following:
Let’s get back to our example in section 1.2, where we had a random sample of 237 pupils who were asked if they ever got into troubles at school. The result is the table below:
Trouble
Boys
46
Girls
37
No Trouble
71
83
Now, in section 1.1. we defined joint and marginal frequencies. We see our joint frequencies in the table above. Let’s add marginal frequencies to this table as well (marked in grey). This is our result:
Trouble
No Trouble
Boys
46
Girls
37
71
83
117
120
154
83
237
Because the number of pupils is varying per gender, it is hard to compare boys and girls in that way. Therefore, let’s standardize joint frequencies by dividing counts within each cell by the overall total. In addition, let’s standardize marginal frequencies by dividing each marginal frequency by the overall total (located on the bottom right corner  237).
Trouble
No Trouble
Boys
46/237
Girls
37/237
71/237
83/237
117/237
120/237
154/237
83/237
237/237
Trouble
No Trouble
Boys
0.19
Girls
0.16
0.30
0.35
0.49
0.51
0.65
0.35
1
This way, joint frequencies become joint probabilities, or observed probabilities (marked in green): it takes into account two categorical variables. For instance, 0.19 is a probability of a trouble making boy P(B, Trouble). Marginal frequencies become marginal probabilities  it takes into account only one of the categorical variables. For instance, 0.49 is a probability of being a boy P(B), or 0.35  a probability of being in trouble P(Trouble).
P(Boy)
P(Girl)
P(Trouble)
P(No Trouble)
Trouble
No Trouble
Boys
0.19
Girls
0.16
0.30
0.35
0.49
0.51
0.65
0.35
1
In the beginning, we outlined our hypothesis statements as follows:
Probabilities in statistics states that if two events are independent, the following equation is satisfied:
where X and Y are some events.
Chisquare test is based on this assumption. Therefore, if is true, meaning that X and Y are independent, the following equation will be satisfied:
On the left side of this equation we see our joint probability, and on the right side of this equation, we see two marginal probabilities.
First, Chisquare uses this assumption to calculate expected probabilities (joint probabilities) from its marginal probabilities. For instance, expected probability for trouble making boys is P(B) * P(Trouble) = 0.35 * 0.49 = 0.17, or expected probability for trouble making girls is P(G) * P(Trouble) = 0.35 * 0.51 = 0.18 etc. Let's see all the expected probabilities in the brackets below:
Trouble
No Trouble
Boys
0.19 (0.17)
Girls
0.16 (0.18)
0.30 (0.32)
0.35 (0.33)
0.49
0.51
0.65
0.35
1
In other words, when we calculate the expected probabilities, we calculate probabilities that we should expect if is true, or, if X and Y are independent variables. That means that, if boys and trouble status are independent variables, our expected probability for good behaving boys is 0.32.
Secondly, we should measure the differences between the actual probabilities (actual joint probabilities in our tables) and those expected probabilities we have just calculated. If we see that the difference between our actual probabilities and the probabilities we expect to have (in case two categorical variables are independent) is huge, then our variables are most likely not independent. Similarly, if the difference between our actual probabilities in our table and probabilities that we suppose to get (in case of independence) is small our two variables are most likely independent. The difference between these two probabilities is represented by value that we have to calculate using the formula:
where

N = total count

observed = observed count for each cell

expected = expected count for each cell (to be discussed later)

observed.p = observed joint probability for each cell

expected.p = expected joint probability for each cell
So, we just plug the values from the table above:
Last step is to compare value with the value in the distribution table (denoted as ) to conclude if you should accept or reject . The following procedures applies:
 accept : meaning that you have enough statistical to conclude that two variables are dependent.
 reject : meaning that you have enough statistical to conclude that two variables are not dependent.
Let's look at the distribution table and pick the right .
To get a proper from the table, we have to know two things: Significance level and degrees of freedom. In the table, significance level ( ) is on the top, and degrees of freedom ( ) is on the left side.
In our contingency table we have 2 rows and 2 columns, so:
Now we have both and values and can compare these two to make a decision.

The significance level is something that you choose yourself. Let’s use a significance level of 5%, so = 0.05.

Degrees of freedom in Chisquared test for independence is calculated using a formula:
If is small, this implies the observed count (or probability) is close to the expected count (or probability) (under the assumption that the two variables are independent).
So, any estimated value which is below our = 3.8415 means that these two variables are independent, or have no relationships inbetween.
With significance level of 0.05 and degree of freedom of 1, we have = 3.8415 (first row, second column)
degree of freedom ( )
As our , we can reject and conclude that the gender and trouble status are not correlated with each other.
4. Conclusion
So, here are the steps to make a Chisquare test:

Add marginal frequencies to a contingency table

Translate joint and marginal frequencies into probabilities

Estimate the expected probability for each cell

Calculate

Compare with and make a decision
5. Remarks

The Pearson's Chisquared test can be used to test for three scenarios: independence, goodnessoffit, and homogeneity. The procedure for these three tests is similar.

There are a few types of Chisquared tests, with the Pearson's Chisquared test being one of them.

Pearson's Chisquare test is an approximate test. It relies on the theoretical assumption that the sample size is large, and approximate the test using the distribution. If we have small cell counts, we can use Fisher's exact test.

Instead of working with the expected probability, we can also work with the expected counts. The expected counts is the expected joint frequency if the two variables are independent. To calculate the expected count, simply multiply the corresponding marginal probabilities together with the overall count. For example, the expected count of trouble making boys will be
N * P(Boys) * P(Trouble) = (237)•(0.49)•(0.35) ≈ 40.65.
Following this we can create a similar table as before:
Trouble
No Trouble
Boys
46 (40.65)
Girls
37 (42.65)
71 (75.48)
83 (78.57)
0.49
0.51
0.65
0.35
237
We can then calculate the test statistics using the first equation:
Theoretically, the test statistics calculated from this approach should match with the previous approach. The difference is simply due to the rounding issue.
6. R Script
You can find Pearson's Chisquared implemented in R following this link. Note that, even though the file is in python notebook, it is actually an R script.