# 1. Introduction: Sample vs Population

Imagine the company “Moogle” with 335,000 employees. You would like to understand what is the average age (mean) of employees in that company. Asking 335,000 employees costs a lot of time. Instead, you think of randomly asking 40, which will still get you to approximately the same results, given your sample is a good representation of the population.

A population mean (    ) is one of the parameters of a population and therefore is called a population parameter. In our example, it is an average age of 335,000 employees. A sampling mean      is called a sample statistic. In our example, it is an average age of 40 employees.

In this particular example, we try to “learn” the population mean (that comes from 335,000 employees) based on an estimation of our sample mean (that comes from 40 observations).   In other words, we try to make inferences from our sample to a population. Thus, the name Inferential Statistics - the study of approximating parameters of our population through statistics of our sample.

# 2. Hypothesis Testing

What is hypothesis testing and why we need it in Inferential Statistics? Let’s motivate this with a basic example.

One day, while waiting for your bus to work, you notice a sign on the bus stop saying the average wait time for the arrival of a bus is 12 minutes. Being an experienced bus rider, you believe this claim is nonsense (you know, waiting for a bus to come is like waiting for the end of a work day), and you decide to dispute this claim. In particular, you want to show the waiting time is longer than 12 minutes. Therefore, for the next 50 days, you (on purpose) miss a bus and record the time it takes for the next bus to arrive, and you get an average of 13.5 minutes. Now, is there enough evidence to claim that the waiting time should be longer? Often times we encounter situations like these in real life, where we are given a claim about a certain phenomenon (for instance, claim about a population mean, which is 12 minutes in our example), but we want to test it, that is, to test if the claim is right or wrong. In order to prove we are correct (or to prove the claim is wrong), we need to gather data and perform some statistical analysis. We call this procedure a hypothesis test.

When we do hypothesis testing, we have to set up two hypotheses with a population mean (     ): a null hypothesis and an alternative hypothesis:

• The Null Hypothesis (      ) states that the population mean equals to the status quo.

In the bus example above, the status quo is a 12-minute bus average waiting time, which is assumed to be true. Thus:

# :           = 12

• The Alternative Hypothesis (       ) states what we believe in, which varies depending on what we want to test:

In our bus example, the status quo is a 12-minute bus average waiting time.

Therefore, our Alternative Hypothesis here would be:

: the population mean is larger than the status quo.

# :            > 12

​​       : the population mean is smaller than the status quo.

In our bus example, the status quo is a 12-minute bus average waiting time.

Therefore, our Alternative Hypothesis here would be:

# :            < 12

​​       : the population mean does not equal to the status quo.

In our bus example, the status quo is a 12-minute bus average waiting time.

Therefore, our Alternative Hypothesis here would be:

# :            ≠ 12

Let’s summarize these three types of hypothesis statements in the table below. Remember: you have to choose only one out of three alternatives hypothesis.

Where        is the value of our population mean (the status quo).

If we go back to our example, the sign shows that the waiting time is only 12 minutes, and we believe that the waiting time is larger than what the sign shows (or larger than our status quo). Thus, we have our hypothesis statements to be the following:

# :           > 12

Where         is our status quo, and         is our alternative version of what we believe in.

After we set up hypothesis statements, we now want to make a test whether we have enough statistical evidence to conclude that the waiting time for a bus is indeed higher than 12 minutes. So now it is time to choose the correct test for this example.

3. Choosing The Correct Test

In statistics, tests for numerical data are divided into three categories: tests for one sample, tests for two samples, and tests for more than two samples. In our example, we have only one sample - a sample that consists of 50 waiting times that you have measured. Thus, we need one-sample tests (look at our Roadmap to see the overall picture).

One sample tests are generally divided into two types: testing for means and testing for proportion. When testing for means, we have two tests: z-test and t-test. Let’s understand what test is used when.

Z-test assumes a standard normal distribution (also called Gaussian distribution) and is used when we know our population standard deviation. T-test does not assume standard normal distribution but is quite similar to it, and is used when we do not know our population standard deviation. Proportion test, on the other hand, relies on a different set of assumption (see the Roadmap). Here we will focus only on the z-test and t-test.

Both tests, however, should have a normality assumption. If we use Z-test, we know its population standard deviation and already assume that our sample comes from a standard normal distribution. But when we do not know population standard deviation (that is, when we use a t-test), we have to make sure that our normality assumption is satisfied. This can be done by having more than 30 observations in your sample, due to the Central Limit Theorem.

Below you will find the comparison of the two tests:

# Assumption

Normality assumptions

Both z-test and t-test require normality assumptions. We cannot use z or t test if we cannot assume that the distribution we are sampling from is normal.

# Why it works

• Since we know population standard deviation, this is a direct application of the Central Limit Theorem and properties of the Normal distribution (given the sample size is large, say more than 30).

• If it is not possible to take more than 30 observations, we can check if these observations follow normal distribution by means of other tests, e.g. Kolmogorov-Smirnov test.

Since we do not know a population standard deviation, we have to assume its normality. This can be done by:

• Taking more than 30 observations (due to a Central Limit Theorem)

• If it is not possible to take more than 30 observations, we can check if these observations follow normal distribution by means of other tests, e.g. Kolmogorov-Smirnov test.

If we fail to assume normality, we cannot use a t-test and have to switch to its non-parametric alternatives.

In our bus example, we do not know the standard deviation of the population where our mean = 12 minutes came from, so we will use T-test. Remember: for t-test, we also need to assume normality, or have at least 30 observations (we have 50).

4. T-test

In order to perform t-test we need to:

1. Compute t-statistic -

2. Find the critical value in a table based on an ɑ (significance) value -

3. Compare                  with              , and make a decision

You do not need to understand these steps by now. Let’s go and explain each of them.

4.1 Compute t-statistic

The first step to perform the t-test is to compute the t-statistic (               ) using the formula:

where

-  is a sample mean

-  is a hypothesized mean

n  -  is a sample size (number of observations),

s  -  is an estimated sample standard deviation: since σ is unknown, it can be approximated by the sample standard deviation s using the formula:

Let’s calculate t-statistic from our bus example. Also, assume for now that we have calculated s to be 5 (we will estimate s in another example).

= 13.5

= 12

= 50 (observations)

= 5

s

n

Now we can calculate                 :

4.2 Find critical value -

Now we need to find the right              in the table. Technically, to find             , we do not need a 4.1 step. Let’s look at the table to understand what we mean:

You can see that to find our             , we need to know only two parameters: so-called Degrees of Freedom v, and our significant level (  ). Let's understand what each of them means.

The values in the table are            , such that P(T >         ) =  ,   where T is the t-distribution of the appropriate degrees of freedom.

• v (degrees of freedom)

v = n - 1 ,  where n is the sample size

In our example, sample size = 50, so our degrees of freedom v = 50 - 1 = 49. Since in the table above we have to choose either 40 or 60, we always go for the lower number, so the row we need has v = 40.

However, in most computer softwares that can compute the t-critical value, we will be able to compute the exact critical value when v = 49.

Significant level (alpha) is something that we determine ourselves. It is the probability of an error that we allow to have. More specifically, it is the probability which we are willing to take when reject the null hypothesis that we shouldn’t have rejected. For example, if we take alpha to be 1%, it means we allow only a 1% chance of making such a wrong decision, while rejecting the null hypothesis. In more formal terms, we call this the probability of Type I error.

Let’s provide an example: you would like to estimate the IQ mean score of a certain school. You randomly pick 20 pupils in a school canteen during lunch and give them an IQ test. It could happen that, for some reason, only intelligent pupils were eating at the canteen, or you solely by chance selected only intelligent ones, although it was pure randomness. Suppose you then calculated the sample IQ average score (which, obviously, is quite high), and you now want to claim that the average IQ score of the school should be high. You want to be able to make this claim with certain confidence, i.e. minimizing the chance to make any error (such as accidentally picking up only intelligent pupils for your sample). Therefore, we will compare the chance of observing this phenomenon (which is extremely small, but it still exist), with a threshold of a certain percent, say, 1%, and this threshold is the significance level (more will be discussed later).

Normally, the default significance level is 5%, because it allows us to be more critical for the test (the smaller the significance level, the more statistical evidence is needed to prove hypothesis), but that also depends on what we test (see Category Tests).

Let’s take a default value of 5% for our test. That means we are ready to have probability of 5% or less that our sample mean of 13.5 minutes occurred solely by chance.

•    (significance level)

According to the table, with the degrees of freedom v = 40 and a significance level     = 0.05, our             = 1.684.

4.3 Comparing                  &

Now that we have both                    and              , let’s understand how to compare these two values with each other to deduce whether we have enough statistical evidence to reject our status quo of a waiting time of 12 minutes (null hypothesis) and accept the fact that it is actually higher than 12 minutes (alternative hypothesis).

2.12

1.684

4

0

-2

-4

This is our t-distribution, where the x-axis is all the possible values of t. Let’s plot our                 and                on this x-axis. Now, let’s look at the area under this entire t-distribution’s bell curve. This area represents probabilities. Because we know that the probability cannot be more than 100%, a total area under this curve equals to 1. Let’s look at the area of our                , shaded in red color, and the area of our             , shaded in blue. We already know that these two areas are certain probabilities, but what exactly are those probabilities?

The area corresponding to               is the significance level (shaded in blue) that we have agreed on and set in 4.2 to be 5%. You already know that this is the probability of an error that we allow to have in our sample results. Now let’s have a look at the area corresponding to              (shaded in red). This area is called the p-value. P-value is something that we do not set up beforehand, unlike critical value, but calculate based on our                  . It is an actual probability that something as extreme as our results from our sample occurred, when assuming the null hypothesis is true. It means that if our p-value is small, that’s implying the null hypothesis is likely to be false, since we are able to observe a data set that doesn’t follow the null distribution under such a small chance. More formally, p-value should not be bigger than our significance level if we want to prove that the alternative hypothesis is true and the null hypothesis (status quo) is false.

In case our p-value is bigger than our significance level, or, our estimated probability of an error is bigger than the allowed probability of an error that we set, we would not have enough statistical evidence to conclude that our Alternative Hypothesis is true (and reject our Null Hypothesis). However, this is not saying the null hypothesis should remain true; we simply fail to reject the null hypothesis at this point. Just because we don’t see enough statistical evidence with this sample, it doesn’t mean we won’t with another sample. Moreover, it might be the case that the data exhibits a completely different direction from the current alternative hypothesis (see part 5), but we are simply testing for the wrong direction.

Reject        ;  Accept

Reject        ;

p-value ≤  significance level

p-value > significance level

In our example, we have p-value lower than our significance level, so we can reject our status quo of 12 minutes (      ) and accept our alternative hypothesis that the real waiting time is actually higher (      ). Since our p-value (marked in red) and significance level (marked in blue) are corresponding to our               and             respectively, we should have                   ≥              to reject           and accept          .

Reject        ;  Accept

Reject        ;

≥

<

5. T-distribution

T-distribution is a distribution with only one parameter - degrees of freedom (v), that completely determines the shape of that distribution. Remember that the degrees of freedom in our one-sample t-test is n - 1.

6. All Types of Hypothesis Testing

If we go back to our example, the sign showed that the waiting time is only 12 minutes, but we anticipated the waiting time to be longer. Thus, we believed that the actual waiting time is larger than what the sign shows (or larger than our status quo). We call this one-tailed test as we are interested in testing if it is larger (also known as upper-tailed test), with the following hypothesis:

# :           > 12

Now imagine that the sign still shows waiting time to be 12 minutes, but we believe the waiting time to be 10.5 (which is shorter). We call this one-tailed test as we are interested in testing if it is smaller (also known as lower-tailed test), with the following hypothesis:

# :           < 12

The steps to understand if we have enough statistical evidence to conclude that our alternative hypothesis is true would be exactly the same (steps 4.1 - 4.3), with one small adjustment on the critical value: the critical value should be negative.

Let’s now imagine that the sign still shows waiting time to be 12 minutes. Imagine that even though we anticipated the waiting time to be longer (just because it feels like it), we actually not sure if the waiting time is larger or smaller than 12 minutes. What we believe is that it is definitely not 12 minutes (so either less or more). So, we would like to test if our real waiting time is not the same as the waiting time of 12 minutes that the sign shows. We call this two-tailed test as we are interested in testing both if it is larger or smaller, with the following hypothesis:

# :           ≠ 12

The steps to understand if we have enough statistical evidence to conclude that our alternative hypothesis is true would be almost the same (steps 4.1 - 4.3) with one small adjustment on the critical value: the significance level needs to be divided by 2 when calculating the critical value for a two-tailed test.

More formally, when searching for               in the table, we have to divide our agreed significance level by 2. For instance, if we agree to have a significance level of 5%, we will search for the right             based on DF = n-1 and significance level = alpha/2. Thus, we will find the right             with DF = 49 (40 in the table) and v = 0.05/2 = 0.025, which is 2.021 in the table. On the contrary, if we are to make a one-tailed test (both upper-tailed and lower-tailed) with the same significance level of 5%, we will find the right               with DF = 49 (40 in the table) and v = 0.05 to be 1.684, as we did in our example above.

6.1 Hypothesis Summary

Now let's have an overview of the tests for t-test.

Test type

Reject          if

One-tailed

Upper-tailed

Lower-tailed

Two-tailed

≤

≤

≥

<-(because of negative values)

|              |≥|            |

# 7.   One-sample t-test in R

View/download a template of one-sample t-test located in a git repository here.