Classification and Class Imbalance

The classification of data where the distribution of the instances of the class is very different from the uniform distribution is a relatively common situation in some industries. More concretely, unbalanced classes usually refer to a classification problem where classes are not equally represented. There are cases where a class imbalance is not only common, it is expected. For example, in the case of the detection of fraudulent transactions, there is an imbalance. The majority of transactions will be in the “Non-Fraud” class and a very small minority in the “Fraud” class. This class imbalance clearly increases the difficulty of learning through the classification algorithm. Indeed, the algorithm has few examples of the minority class to learn from. It is therefore biased towards the population of negatives and produces potentially less robust predictions than in the absence of imbalance.

Case of class imbalance, where the red dots represent the minority class.

Performance Metrics

Decision threshold metrics: Accuracy vs others

How to measure the performance of its algorithm in these situations? The first thing to be wary of is accuracy. Indeed, in the case of class imbalance, accuracy can be misleading. With a data set of two classes, where the first class represents 90% of the data, if the classifier predicts that each example belongs to the first class, the accuracy will be 90%, but this classifier is useless in practice. Other metrics are more relevant in the case of class imbalance.

  • Precision to minimize the error rate among the examples predicted positively by the model
  • The recall to try to detect a maximum of positive
  • The F1-score to find a compromise between precision and recall. When false positives are as costly false negatives.

Other metrics are very efficient and informative for the data scientist but less interpretable than the previous ones. Among them are:

  • Cohen’s Kappa metric which is generally used to measure the classifier’s performance by comparing it to that of a random classifier. In the context of unbalanced classes, it is used by comparing the system model classifying all the examples as being of the majority class.
  • The lift curve in marketing targeting for example. The lift is a measure of the efficiency of a predictive model calculated as the ratio between the performances obtained with and without the predictive model for a proportion of targets (randomly chosen vs. chosen by a machine learning algorithm) contacted.
  • The Brier Score is a calibration estimation of the probability distribution emitted by the algorithm. It is calculated by taking the mean square error between the probabilities emitted by the algorithm for the observed class and the class in question.
Brier Score

Model-Wide metrics: ROC Curve vs Precision Recall-Curve

The ROC curve is one of the most popular model-wide metrics (testing the algorithm for several classification thresholds). However, in the context of class imbalance, the Precision-Recall curve should be preferred. Indeed, the ROC curve is not sensitive to the rate of imbalance because the false positive rate, on the x-axis of the ROC curve, is stable when the negative rate is high. In the same way, the rate of true positives, in the ordinate of the curve, does not take into account this imbalance. The Precision-Recall curve integrating the notion of imbalance, via the precision on the x-axis and the recall on the y-axis is, therefore, more informative in this context. For the purpose of model evaluation that is not specific to a field of use, the area under the Precision-Recall curve is the preferred metric.

Data Processing: Resampling in the context of class imbalance

To tackle class imbalance, two approaches are possible. It is possible to adapt the learning stage to this situation or to adapt the data processing stage of the data science project. It is this second approach that is studied most thoroughly in this article, through the resampling of the dataset.

Undersampling

Undersampling consists in rebalancing the dataset by decreasing the number of instances of the majority class.

Random Undersampling

Random Undersampling involves randomly drawing samples from the majority class, with or without replacement. However, it can increase the variance of the classifier and can potentially eliminate useful or important samples.

Tomek Links 

Tomek Links removes undesirable overlap between classes where majority class links are removed until all pairs of closest neighbors, at minimum distance, are of the same class.

Tomek Links

You may also be interested in undersampling using the Edited Nearest-Neighbor and the NearMiss (1, 2 and 3) which can be parameterized to obtain a stronger subsampling than with the Tomek Links.

Area Under PR Curve vs imbalance rate obtained with and without Tomek Links using a random forest(rf) and logistic regression(lr) for a varying rate of imbalance.
Dataset generated with make_classification of Scikit-Learn.

When only considering the problem of class imbalance, without considering the other characteristics of the dataset, the impact of Tomek Links is small.

Oversampling

Oversampling consists in rebalancing the dataset by artificially increasing the number of instances of the minority class.

Random Oversampling

Random Oversampling involves duplicating randomly picked instances of the minority class.

SMOTE — Synthetic Minority Over-sampling Technique

Rather than replicating minority observations, synthetic minority oversampling (SMOTE) creates a user-selected number of synthetic observations on segments between items close to the minority class.

red dots: synthesized instances
black dots: minority class
squares: majority class

You may also check the smote variants available notably in the python smote-variants package.

Area Under PR Curve vs imbalance rate obtained with and without smokes using random forest(rf) and logistic regression(lr) for varying imbalance rate.
Dataset generated with make_classification of Scikit-learn.

When only considering the obstacle represented by the class imbalance, without considering the other characteristics of the dataset, we notice that smote has a positive, neutral, or negative impact on the performance of both algorithms.

It is then clear that resampling is no magic formula to tackle class imbalance. It is necessary to go further in the analysis of the dataset to understand when it is relevant.

As a last resort: cost-sensitive learning

Cost-sensitive learning shows a possible way of modifying the learning stage in the context of class imbalance.

In theory

When it is possible to have a good estimate of the cost of each type of error (false positive and false negative), it can be interesting to use cost-sensitive learning. Cost-sensitive learning is the fact of associating a different cost to each type of error. It requires the definition of a cost matrix:

In that case, it will be wise to use a metric directly linked to this cost matrix. A cost function such as the one below could be used as a metric:

To be even finer, one can define a cost function for each type of error and have a specific cost for each example. Alejandro Correa Bahnsen created a package, CostCla, for this purpose and called this technique “Example-Dependent Cost-Sensitive Learning”: for each example i, a calculated cost. It requires a different cost matrix, called a cost function matrix :


Cost-sensitive learning allows us to respond directly to the desired problem. For example, in the case of fraud detection, the aim being to lose as little money as possible, on the one hand by letting too many frauds go by and on the other hand by being too strict with one’s clients, it will be possible to define a cost function for each type of error with the help of an expert(a credit analyst). Thus by minimizing a loss function including these specific costs, one directly minimizes the bank’s losses.

In practice

In practice, cost-sensitive learning can be applied very simply on most machine learning libraries (sklearn, lightgbm, xgboost…). On scikit-learn, most of the classification models have “class_weight” as parameter. It is used to apply a cost inversely proportional to the class imbalance. Visually, this allows balancing the bias of the model towards the minority class as illustrated below where the decision boundary of a linear kernel SVM is translated to the minority class.

SVM with class weight

Towards classification in general

Beyond Class Imbalance

The challenge in classification is to find a decision boundary between classes. This can be done through :

  1. Choosing or designing and then optimizing a classification algorithm
  2. Extensive data processing.

In practice, it is usually the second method that brings the most performance. In the context of the class imbalance, it can be seen that this issue only increases the difficulty posed by other characteristics of the dataset. An example is the separability coefficient (class_sep) controlled in dataset generation via make_classif from Scikit-learn.

The previous graphs show that :

  • Classification with a class imbalance is not necessarily difficult (2)
  • That class imbalance increases the difficulty posed by the characteristics of the dataset (here the separability coefficient of make_classif).

Indeed, the rate of imbalance in 1 makes classification more difficult than that presented in 3.

To go further on separability and difficulty of classification, refer to Articles [10], [11] and [12].

Then what?

Finally, a dataset with a class imbalance must be treated like any other. It will be a question of spending more time modeling the problem to have, among other things, a maximum class separability. This can be done by :

  • Advanced feature engineering to generate discriminant variables (and increase separability)
  • So make sure you have good quality data to avoid that creating noise with feature engineering.
  • Dimension reduction (discussed in another article)
  • A good definition of the target to ensure that it solves the right problem (and therefore that we have the right number of positives)
  • The use of adapted resampling  
  • And lastly cost-sensitive learning with relevant cost modeling.

The conclusion for the data scientist: A class imbalance requires spending more time modeling the problem and understanding the data set. So it will only make the challenge more interesting!
In the next article, we will see the geometrical and topological aspects of hard classification.

Thanks for reading!

References 

Metrics

[1] Lift: https://www3.nd.edu/~busiforc/Lift_chart.html

[2] Cohen’s Kappa: https://en.wikipedia.org/wiki/Cohen%27s_kappa

[3] Popular classification metrics: http://www.davidsbatista.net/blog/2018/08/19/NLP_Metrics/

[4] Scikit-learn guide on metrics: https://scikit-learn.org/stable/modules/model_evaluation.html

[5] Brier score: https://en.wikipedia.org/wiki/Brier_score

[6] Precision-Recall vs ROC curve: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4349800/

Library

[7] Imblearn: https://imbalanced-learn.readthedocs.io/en/stable/index.html

[8] smote-variants: https://pypi.org/project/smote-variants/0.3.1/

[9] Cost Sensitive Classification – CostCla : http://albahnsen.github.io/CostSensitiveClassification/Tutorials.html

On Hard classification

[10] On separability of classes in classification: http://gkeng.me/index.php/2019/07/17/on-separability-of-classes-in-classification/

[11] An instance-level analysis of data complexity: https://link.springer.com/content/pdf/10.1007%2Fs10994-013-5422-z.pdf

[12] How complex is your classification problem? : https://arxiv.org/abs/1808.03591


Categories: Data Science

Leave a Reply

Your email address will not be published. Required fields are marked *