Handling Imbalanced Data

SMOTE

1.   Introduction

SMOTE stands for Synthetic Minority Oversampling Technique. This is a statistical technique for increasing the number of cases in your dataset in a balanced way. The module works by generating new instances from existing minority cases that you supply as input. SMOTE does not change the number of majority cases.

 

The new instances are not just copies of existing minority cases; instead, the algorithm takes samples of the feature space for each target class and its nearest neighbors and generates new examples that combine features of the target case with features of its neighbors. This approach increases the features available to each class and makes the samples more general.

2.   High-level Overview

SMOTE synthesizes new minority instances between existing (real) minority instances. Let’s look at the graph below. Red data points represent the minority class. Imagine that SMOTE draws lines between these existing minority instances like this:

SMOTE then randomly generates new, synthetic minority instances somewhere on these lines.

After synthesizing new minority instances, the imbalance shrinks from 4 red versus 13 green to 12 red versus 13 green data points.

4.   SMOTE Hyperparameters in Python:

Smote has two hyperparameters to specify in Python:

  1. K-neighbours (K): determine what vectors to use to generate new data

  2. Ratio (ratio): determines how many new data points to generate on those vectors to achieve a certain ratio.

SMOTE thinks from the perspective of existing minority instances and synthesizes new instances at some distance from them towards one of their neighbors. Which neighbors are considered for each existing minority instance?

 

At K = 1 only the closest neighbor of the same class is considered. Let’s take the bottom right, red data point. By drawing a concentric circle around the dot on the plot, we can easily see that the top left flower is its closest neighbor (Figure below). Thus, at K = 1 SMOTE synthesizes a new minority instance on the line between these two dots when the bottom right flower is considered.

At K = 2 both the closest and the second closest neighbors are considered. For each new synthesis (new data point), a new one is randomly chosen between them, meaning that it’s randomly put on the vector.

In general, one might say that SMOTE loops through the existing, real minority instance. At each loop iteration, one of the K closest minority class neighbors is chosen and a new minority instance is synthesized somewhere between the minority instance and that neighbor.

 

The ratio parameter is the ratio to use for resampling the data set. It answers the question of how many times SMOTE()should loop through the existing, real minority instances.

Warning: Increasing the number of cases using SMOTE is not guaranteed to produce more accurate models. You should try experimenting with different percentages, different feature sets, and different numbers of nearest neighbors to see how adding cases influences your model.

3.   How to configure SMOTE:

How to configure SMOTE:

  1. Import SMOTE module from imblearn.over_sampling.

  2. Apply SMOTE to your training set X_train, Y_train.

  3. Ensure that the column containing the label, or target class, is marked as such.

  4. The SMOTE module automatically identifies the minority class in the label column and then gets all examples for the minority class.

  5. Specify the SMOTE ratio.

  6. Use the Number of nearest neighbors option to determine the size of the feature space that the SMOTE algorithm uses when in building new cases. The nearest neighbor is a row of data (a case) that is very similar to some target case. The distance between any two cases is measured by combining the weighted vectors of all features.

    • By increasing the number of nearest neighbors, you get features from more cases.

    • By keeping the number of nearest neighbors low, you use features that are more like those in the original sample.

Run the experiment.
The output of the module is a dataset containing the original rows plus some number of added rows with minority cases.