On Classification and Class Imbalance

Classification and Class Imbalance

The classification of data where the distribution of the instances of the class is very different from the uniform distribution is a relatively common situation in some industries. More concretely, unbalanced classes usually refer to a classification problem where classes are not equally represented. There are cases where a class imbalance is not only common, it is expected. For example, in the case of the detection of fraudulent transactions, there is an imbalance. The majority of transactions will be in the “Non-Fraud” class and a very small minority in the “Fraud” class. This class imbalance clearly increases the difficulty of learning through the classification algorithm. Indeed, the algorithm has few examples of the minority class to learn from. It is therefore biased towards the population of negatives and produces potentially less robust predictions than in the absence of imbalance.

Case of class imbalance, where the red dots represent the minority class.

Performance Metrics

Decision threshold metrics: Accuracy vs others

How to measure the performance of its algorithm in these situations? The first thing to be wary of is accuracy. Indeed, in the case of class imbalance, accuracy can be misleading. With a data set of two classes, where the first class represents 90% of the data, if the classifier predicts that each example belongs to the first class, the accuracy will be 90%, but this classifier is useless in practice. Other metrics are more relevant in the case of class imbalance.

  • Precision to minimize the error rate among the examples predicted positively by the model
  • The recall to try to detect a maximum of positive
  • The F1-score to find a compromise between precision and recall. When false positives are as costly false negatives.

Other metrics are very efficient and informative for the data scientist but less interpretable than the previous ones. Among them are:

  • Cohen’s Kappa metric which is generally used to measure the classifier’s performance by comparing it to that of a random classifier. In the context of unbalanced classes, it is used by comparing the system model classifying all the examples as being of the majority class.
  • The lift curve in marketing targeting for example. The lift is a measure of the efficiency of a predictive model calculated as the ratio between the performances obtained with and without the predictive model for a proportion of targets (randomly chosen vs. chosen by a machine learning algorithm) contacted.
  • The Brier Score is a calibration estimation of the probability distribution emitted by the algorithm. It is calculated by taking the mean square error between the probabilities emitted by the algorithm for the observed class and the class in question.
Brier Score

Model-Wide metrics: ROC Curve vs Precision Recall-Curve

The ROC curve is one of the most popular model-wide metrics (testing the algorithm for several classification thresholds). However, in the context of class imbalance, the Precision-Recall curve should be preferred. Indeed, the ROC curve is not sensitive to the rate of imbalance because the false positive rate, on the x-axis of the ROC curve, is stable when the negative rate is high. In the same way, the rate of true positives, in the ordinate of the curve, does not take into account this imbalance. The Precision-Recall curve integrating the notion of imbalance, via the precision on the x-axis and the recall on the y-axis is, therefore, more informative in this context. For the purpose of model evaluation that is not specific to a field of use, the area under the Precision-Recall curve is the preferred metric.

Data Processing: Resampling in the context of class imbalance

To tackle class imbalance, two approaches are possible. It is possible to adapt the learning stage to this situation or to adapt the data processing stage of the data science project. It is this second approach that is studied most thoroughly in this article, through the resampling of the dataset.

Undersampling

Undersampling consists in rebalancing the dataset by decreasing the number of instances of the majority class.

Random Undersampling

Random Undersampling involves randomly drawing samples from the majority class, with or without replacement. However, it can increase the variance of the classifier and can potentially eliminate useful or important samples.

Tomek Links 

Tomek Links removes undesirable overlap between classes where majority class links are removed until all pairs of closest neighbors, at minimum distance, are of the same class.

Tomek Links

You may also be interested in undersampling using the Edited Nearest-Neighbor and the NearMiss (1, 2 and 3) which can be parameterized to obtain a stronger subsampling than with the Tomek Links.

Area Under PR Curve vs imbalance rate obtained with and without Tomek Links using a random forest(rf) and logistic regression(lr) for a varying rate of imbalance.
Dataset generated with make_classification of Scikit-Learn.

When only considering the problem of class imbalance, without considering the other characteristics of the dataset, the impact of Tomek Links is small.

Oversampling

Oversampling consists in rebalancing the dataset by artificially increasing the number of instances of the minority class.

Random Oversampling

Random Oversampling involves duplicating randomly picked instances of the minority class.

SMOTE — Synthetic Minority Over-sampling Technique

Rather than replicating minority observations, synthetic minority oversampling (SMOTE) creates a user-selected number of synthetic observations on segments between items close to the minority class.

red dots: synthesized instances
black dots: minority class
squares: majority class

You may also check the smote variants available notably in the python smote-variants package.

Area Under PR Curve vs imbalance rate obtained with and without smokes using random forest(rf) and logistic regression(lr) for varying imbalance rate.
Dataset generated with make_classification of Scikit-learn.

When only considering the obstacle represented by the class imbalance, without considering the other characteristics of the dataset, we notice that smote has a positive, neutral, or negative impact on the performance of both algorithms.

It is then clear that resampling is no magic formula to tackle class imbalance. It is necessary to go further in the analysis of the dataset to understand when it is relevant.

As a last resort: cost-sensitive learning

Cost-sensitive learning shows a possible way of modifying the learning stage in the context of class imbalance.

In theory

When it is possible to have a good estimate of the cost of each type of error (false positive and false negative), it can be interesting to use cost-sensitive learning. Cost-sensitive learning is the fact of associating a different cost to each type of error. It requires the definition of a cost matrix:

In that case, it will be wise to use a metric directly linked to this cost matrix. A cost function such as the one below could be used as a metric:

To be even finer, one can define a cost function for each type of error and have a specific cost for each example. Alejandro Correa Bahnsen created a package, CostCla, for this purpose and called this technique “Example-Dependent Cost-Sensitive Learning”: for each example i, a calculated cost. It requires a different cost matrix, called a cost function matrix :


Cost-sensitive learning allows us to respond directly to the desired problem. For example, in the case of fraud detection, the aim being to lose as little money as possible, on the one hand by letting too many frauds go by and on the other hand by being too strict with one’s clients, it will be possible to define a cost function for each type of error with the help of an expert(a credit analyst). Thus by minimizing a loss function including these specific costs, one directly minimizes the bank’s losses.

In practice

In practice, cost-sensitive learning can be applied very simply on most machine learning libraries (sklearn, lightgbm, xgboost…). On scikit-learn, most of the classification models have “class_weight” as parameter. It is used to apply a cost inversely proportional to the class imbalance. Visually, this allows balancing the bias of the model towards the minority class as illustrated below where the decision boundary of a linear kernel SVM is translated to the minority class.

SVM with class weight

Towards classification in general

Beyond Class Imbalance

The challenge in classification is to find a decision boundary between classes. This can be done through :

  1. Choosing or designing and then optimizing a classification algorithm
  2. Extensive data processing.

In practice, it is usually the second method that brings the most performance. In the context of the class imbalance, it can be seen that this issue only increases the difficulty posed by other characteristics of the dataset. An example is the separability coefficient (class_sep) controlled in dataset generation via make_classif from Scikit-learn.

The previous graphs show that :

  • Classification with a class imbalance is not necessarily difficult (2)
  • That class imbalance increases the difficulty posed by the characteristics of the dataset (here the separability coefficient of make_classif).

Indeed, the rate of imbalance in 1 makes classification more difficult than that presented in 3.

To go further on separability and difficulty of classification, refer to Articles [10], [11] and [12].

Then what?

Finally, a dataset with a class imbalance must be treated like any other. It will be a question of spending more time modeling the problem to have, among other things, a maximum class separability. This can be done by :

  • Advanced feature engineering to generate discriminant variables (and increase separability)
  • So make sure you have good quality data to avoid that creating noise with feature engineering.
  • Dimension reduction (discussed in another article)
  • A good definition of the target to ensure that it solves the right problem (and therefore that we have the right number of positives)
  • The use of adapted resampling  
  • And lastly cost-sensitive learning with relevant cost modeling.

The conclusion for the data scientist: A class imbalance requires spending more time modeling the problem and understanding the data set. So it will only make the challenge more interesting!
In the next article, we will see the geometrical and topological aspects of hard classification.

Thanks for reading!

References 

Metrics

[1] Lift: https://www3.nd.edu/~busiforc/Lift_chart.html

[2] Cohen’s Kappa: https://en.wikipedia.org/wiki/Cohen%27s_kappa

[3] Popular classification metrics: http://www.davidsbatista.net/blog/2018/08/19/NLP_Metrics/

[4] Scikit-learn guide on metrics: https://scikit-learn.org/stable/modules/model_evaluation.html

[5] Brier score: https://en.wikipedia.org/wiki/Brier_score

[6] Precision-Recall vs ROC curve: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4349800/

Library

[7] Imblearn: https://imbalanced-learn.readthedocs.io/en/stable/index.html

[8] smote-variants: https://pypi.org/project/smote-variants/0.3.1/

[9] Cost Sensitive Classification – CostCla : http://albahnsen.github.io/CostSensitiveClassification/Tutorials.html

On Hard classification

[10] On separability of classes in classification: http://gkeng.me/index.php/2019/07/17/on-separability-of-classes-in-classification/

[11] An instance-level analysis of data complexity: https://link.springer.com/content/pdf/10.1007%2Fs10994-013-5422-z.pdf

[12] How complex is your classification problem? : https://arxiv.org/abs/1808.03591


Machine Learning resources, my curated list

During the honeymoon phase, the first MOOCs with Andrew Ng repeating “concretely” 10 times in a 6 minutes video, machine learning seems pretty easy and intuitive. There are plenty of Medium articles or tutorials that we can read quickly, even in the Parisian metro, and understand what is explained.

But sooner or later, during an interview or with coworkers, we come to realize that there is far more to data science than just reading blog articles or following well-designed MOOCs. Proper code versioning, clean code habits, advanced machine learning libraries and algorithms, dataviz, advanced probabilities and statistics … Whether on the theoretical or the practical aspect, there is a huge gap between readers, newcomers, (or sometimes bloggers like me haha) and professional practitioners, real (data) scientists. Last year, still in uni, I took a few hours of my free time to bridge that gap, at least the theoretical aspect of it. These are some of the resources I used. As I am always learning I will add more resources as I discover them.

Courses pdf

Books

These are some books that I spent time reading. They require much more effort than the above shared pdfs.

  1. MLAPP, Machine Learning a Probabilistic Perspective by Kevin Murphy
  2. ESL, Elements of Statistical Learning, by Hastie, Tibshirani, and Freidman
  3. Deep Learning, by Courville, Goodfellow, and Bengio

While Elements of Statistical Learning was the first I read, I found it too verbose in some parts. Chapters 1 to 4 were really worth my time though (these notes helped me a lot!).
Machine Learning a Probabilistic Perspective is my favorite, it is more concise and tries (and fails sometimes) to go straight to the point in every chapter. It is easy to miss some steps in the equations sometimes, but it is part of the learning process haha!
Finally, I read the Deep Learning book “just for fun”. As Deep Learning is more experimental than theoretical, with a lot of trial and error, I did not want to spend too much time trying to understand the theory. Understanding the main architectures(MLP, Convnet, RNN …), backpropagation, or why LSTM and GRU architectures solve the gradient vanishing problem was enough for me.

Other ressources

Some other books or pdfs that I found interesting :

On Separability of classes in Classification

What is class separability?

Theoretical Minimum Error and Overlapping

In the example below we have two classes: $C_0$ et $C_1$. The points of class $C_0$ follow a normal distribution of variance 4. The points of class $C_1$ follow a normal distribution of variance 1. Class $C_0$ represents 90% of the data set and class $C_1$ represents 10%.
The following image represents a dataset containing 50 points as well as the theoretical distributions of the two classes in the corresponding proportions. The overlapping of the two classes is varied by changing the average of class $C_1$.

Separability of two gaussian curves

The theoretical minimum error probability is given by the area below the minimum of the two overlapping curves. It is given by the following expression.
$$
P(false)=\int_RP(false|x)P(x)dx=\int_R min(P(x|C_0), P(x|C_1))dx
$$
This probability could be used as a separability measure because it measures the overlapping between the two distributions of classes $C_1$ and $C_0$. However, in practice we cannot calculate this integral because we do not have the exact expression of the probability densities.

Separability in the linear case

Another expression of class separability is given by wikipedia in the linear case:

Let $X_0$ and $X_1$ be two sets of points in a n-dimensional Euclidean space. Then $X_0$ and $X_1$ are linearly separable if there are $n+1$ real numbers $w_1,w_2,…w_n,k$ such that for any $x \in X_0 \sum_{i=1}^n w_ix_i>k$ and for any $x \in X_1 : \sum_{i=1}^n w_ix_i<k$. <=”” p=””> However, it does not give any separability measures to be used in concrete cases. </k$.>

My trick: supervised clustering

In theory

In the absence of a ready-made separability measure, I have found a way to estimate the separability of classes:

  1. Perform clustering with an algorithm appropriate to your dataset. See scikit learn page.
  2. Choose k, the number of clusters consistent with silhouette analysis. See sklearn..
  3. For these k-values, estimate the separability of the classes by measuring clusters homogeneity, see sklearn.
  4. Choose the k giving the best homogeneity.

This measure involves the conditional entropy of the class conditionally to the cluster, $H(C|K)$, normalized by the entropy of the $H(C)$ class. The lower the conditional entropy, the more important the information given by the K cluster on class C is, and therefore the more homogeneous the clusters are.
The homogeneity score $h$, limited between 0 and 1, is as follows with a maximum value of 1 (perfect homogeneity):

$$
h=1-\frac{H(C|K)}{H(C)}
$$

For more information on this measure, do read this research paper written by Rosenberg and Hirschberg.

It is therefore a supervised clustering, labels are used (involved in the calculation of conditional entropy) to optimize clustering.

In practice

The following image shows the correlation between class separability (and therefore cluster homogeneity) and the performance of several classifiers (Random Forest(RF), KNN, MLP( Multi Layer Perceptron), SVM (RBF Kernel) and Logistic Regression) for an imbalance rate of 1 percent and justifies ( I hope) the use of this class separability measure.

AUC homogeneity

Thus, no matter how much class imbalance there may be, if separability is poor, there is no point in bringing out an artillery of techniques to get around the problem. It will be better to work on the data (feature engineering, creation of new variables, discussion with an expert) to increase class separability.

Ref : The introduction of this article is inspired by this article from Baptiste Rocca

[REPOST] Make your computer invest like a human

While trading may be the most exciting and lucrative domain of application of Machine Learning, it is also one of the most challenging. Trading is not only about buying or selling, nor is it just about analysing the financial state of a target company. One of the reasons why it is so difficult to be a top trader is that it requires to consider a large amount of data of different nature. This also explains the machine learning hype in trading. Text, speech, numbers, images … Machine learning algorithms can deal with almost any type of data. In this series of articles, we will introduce an implementation of a not so common deep learning approach to stock price trend prediction based on financial news. Our inspiration comes from the recent research paper “ Listening to Chaotic Whispers: A Deep Learning Framework for News-oriented Stock Trend Prediction “ — LCW.

Would be nice if my machine was smart like Buffet !

Recent trends in research paper and blog articles

Many approaches introduced in last years’ research papers suffer from incompleteness. One of those approaches consists in designing an algorithm based on last days’ stock prices only. Recurrent Neural Networks are largely used to that extent. Another way is to use sentiment analysis in the trading policy. If you are familiar with machine learning for trading, you have certainly come across “ Stock trading with Twitter using sentiment analysis”. Reinforcement learning is also trendy now, as shown in this paper released in July 2018. While these solutions give impressive results, we think that they underuse the potential power of machine learning algorithms.

What makes this paper so special ?

Nowadays, A.I. is trying to become more human. And algorithmic trading is no exception. Some of the recently published research papers try to design frameworks that imitate real investors. LCW is one of them and that’s why we chose it.

Where is the innovation ?

To us LCW does better at replicating human behavior than many other papers on this topic. As you will understand, this paper is mainly about text mining. Now, imagine you are an investor trying to predict the variation of one stock tomorrow. You may try to get as much information about the company over the last days. And then you get an idea of how the stock price might evolve the next days. This is the use case that LCW tries to solve using a deep learning framework that takes time sequences of articles as input.

The authors have taken into account three characteristics of the learning process followed by an investor struggling with the “chaotic news” :

  • First, the Sequential Context Dependency. This simply refers to the fact that a single news is more informative within a broader context than isolated.
  • Second, the Diverse Influence. One critical news can affect the stock price for weeks, whereas a trivial one may have zero effect.
  • Third, the “Effective and efficient Learning”. It is learning from the more common situations before turning to exceptional cases.

This is not a theoretical paper but rather a math-engineering paper. The design process might be like this:

– We have to deal with sequences of press articles. What neural network could we use for that ?

– Recurrent neural networks are used for sequence modeling. We can use them here.

– OK. Which one is easier to train and does not kill the gradient ?

– GRU neural network.

– Good !

– Now how to deal with diverse influence ? I want my algorithm to focus on the most important articles ?

– I have read this post about attention mechanism by Synced. We can use it

– Alright, now give me a simple neural network to perform a three class classification ?

– Multilayer perceptron !

There you got it :

Hybrid Attention Network

As you can see, innovation is in the design of the whole framework. And in the way they connect different neural networks to solve the whole problem. We will let you read more about the architecture in paragraph 4.2 of the paper.

They have also implemented a Self-Paced Learning algorithm. It aims at performing a more effective and efficient learning. You may read paragraph 4.3 for more information about this algorithm. We did not have enough time to implement this part of the paper.

Our workflow

Pierrick and I are two french engineering student currently in our second year of Engineering Master program. We are by no means expert in trading and beginners in machine learning. It is our first paper implementation, and first technical blog post as well, so we are open to any constructive criticism both on our code and articles.

Our workflow is divided in 4 steps :

  • Scraping

For the purpose of this research project we have scraped all the articles published on reuters.com from 2015 to 2017. We used mainly BeautifulSoup and Urllib library as well as the multiprocessing library. And yes, the whole project is in Python 3.

  • Articles Vectorization

We chose not to follow the paper on this part. After collecting more than 1 million articles (see 5.1.1 on the paper) they have trained a Word2Vec on the whole vocabulary of their articles. And then, they computed the vector mean of all the words in an article to make a vector representation of it. We preferred to use Doc2Vec for a better representation of the article. Our choice was inspired by this comparison. We used Gensim library for that.

  • Dataset Creation

This part consisted in creating a dataset that the HAN network can train on. Our data X is a time sequence of vectorized articles from day t-10 to day t, and the target value Y is the variation of the corresponding stock on day t+1. Scikit-Learn and Pickle libraries were very helpful for this task.

  • Model Training

We used Keras to build the model. The wrapper Time Distributed was of great use to apply attention mechanism to the input and output of the GRU network.

Final word

Implementing this paper was thrilling, and we look forward to writing about each step of the implementation. Many thanks to the authors for this inspiring paper.

To stay in touch, feel free to contact Pierrick or myself on Linkedin !

More to come, we will release the code soon !