Imbalanced Data

Over-sampling & Under-sampling

1.   Introduction

Recently, oversampling the minority class observations has become a common approach to improve the quality of predictive modeling. By oversampling, models are sometimes better able to learn patterns that differentiate classes.

Apart from using different evaluation criteria, one can also work on getting a different dataset. There are three approaches to make a balanced dataset out of an imbalanced one:

 

  1. Under-sampling: resample the data set by decreasing the majority class observations, keeping minority class untouched.

  2. Over-sampling: resample the data set by increasing the minority class observations, keeping majority class untouched  

2.   Under-sampling

Under-sampling keeps all samples in the rare class and randomly selecting an equal number of samples in the abundant class to create a balanced new dataset can be retrieved for further modeling.

Original Dataset

Final Dataset

Under-sampling (majority class)

One of the most common and simplest strategies to handle imbalanced data is to undersample the majority class. This is done by simply selecting n samples at random from the majority class, where n is the number of samples for the minority class, and use them during the training phase, after excluding the sample to use for validation.

3.   Over-sampling

Under-sampling keeps all samples in the rare class and randomly selecting an equal number of samples in the abundant class to create a balanced new dataset can be retrieved for further modeling.

Over-sampling (minority class)

Original Dataset

Final Dataset

3.1.   Why we need over-sampling?

The main motivation behind the need to preprocess imbalanced data before we feed them into a classifier is that typically classifiers (such as logistic regression or random forest) are more sensitive to detecting the majority class and less sensitive to the minority class. Thus, if we don't take care of the issue, the classification output will be biased, in many cases resulting in always predicting the majority class.

 

Oversampling techniques try to balance dataset by artificially increasing the size of rare samples. Rather than getting rid of abundant samples, new rare samples are generated, by using e.g. repetition, bootstrapping or SMOTE (see methods in Machine Learning section). For instance, 10 out of 1000 (=0.01) transactions are found to be fraudulent. We might want to increase those 10 fraudulent transactions by, e.g. 100 (=0.1) to avoid Random Forest discarding the rare class.

 

3.2.   Example

Let's say you have a total of 1,000 bank transactions, and 10 or (0.1) of them are fraud.

In the case of under-sampling, we decide to choose only 90 normal transactions, and that will make fraudulent transactions equal to 10% of the total sample size [10/(90+10)]. This way of undersampling helps us balance the data set. However, it is possible that discarded observations could have some valuable information and such an approach could lead to bias similarly.

In the case of random over-sampling, we randomly add more minority observations by copying some (or all) of those observations on replicating them multiple times. For example, we may increase the fraudulent transactions, from 10 to 110, and now they become 10% of the total observations. Thus, with the same proportion result, there is absolutely no information loss as in the case of under-sampling. However, it is very much prone to overfitting as we have simply "copied" some observations. So, question is, how to deal with it?

 

3.3.   Side-effects of oversampling

However,  because we simply "copied" some observations, the model is very much prone to overfitting, as it is now based on those copies.

 

-> Oversampling the minority class can result in overfitting problems if we oversample before cross-validating.

 

The easiest way to oversample is to re-sample the minority class, i.e. to duplicate the entries or manufacture data which is exactly the same as what we have already. Now, if we do so before cross-validating, i.e. before we enter the leave-one-participant-out cross-validation loop, we will be training the classifier using N-1 entries, leaving 1 out, but including in the N-1 one or more instances that are exactly the same as the one being validated. Thus, defeating the purpose of cross-validation altogether. Let's have a look at this issue graphically:

 

3.   Over-sampling

Under-sampling keeps all samples in the rare class and randomly selecting an equal number of samples in the abundant class to create a balanced new dataset can be retrieved for further modeling.

Over-sampling (minority class)

Original Dataset

Final Dataset

original

dataset

oversampled

dataset

cross-validation

validation

set

training set

majority

class

minority

class

n iterations

From left to right, we start with the original dataset where we have a minority class with two samples. We duplicate those samples, and then we do cross-validation. At this point there will be iterations, such as the one showed, where the training and validation set contain the same sample, resulting in overfitting and misleading results. Here is how this should be done:

n iterations

training set

validation

set

majority

class

minority

class

original

dataset

over-sampled

dataset

cross-validation

First, we start cross-validating. This means that at each iteration we first exclude the sample to use as validation set, and then oversample the remaining of the minority class (in orange). In this toy example I had only two samples, so I created three instances of the same. The difference from before, is that clearly now we are not using the same data for training and validation. Therefore we will obtain more representative results. The same holds even if we use other cross-validation methods, such as k-fold cross-validation.

This was a simple example, and better methods can be used to oversample. One of the most common being the SMOTE technique, i.e. a method that instead of simply duplicating entries creates entries that are interpolations of the minority class, as well as undersamples the majority class. Normally when we duplicate data points the classifiers get very convinced about a specific data point with small boundaries around it, as the only point where the minority class is valid, instead of generalizing from it. However, SMOTE effectively forces the decision region of the minority class to become more general, partially solving the generalization problem.