What is class separability?

Theoretical Minimum Error and Overlapping

In the example below we have two classes: $C_0$ et $C_1$. The points of class $C_0$ follow a normal distribution of variance 4. The points of class $C_1$ follow a normal distribution of variance 1. Class $C_0$ represents 90% of the data set and class $C_1$ represents 10%.
The following image represents a dataset containing 50 points as well as the theoretical distributions of the two classes in the corresponding proportions. The overlapping of the two classes is varied by changing the average of class $C_1$.

Separability of two gaussian curves

The theoretical minimum error probability is given by the area below the minimum of the two overlapping curves. It is given by the following expression.
$$
P(false)=\int_RP(false|x)P(x)dx=\int_R min(P(x|C_0), P(x|C_1))dx
$$
This probability could be used as a separability measure because it measures the overlapping between the two distributions of classes $C_1$ and $C_0$. However, in practice we cannot calculate this integral because we do not have the exact expression of the probability densities.

Separability in the linear case

Another expression of class separability is given by wikipedia in the linear case:

Let $X_0$ and $X_1$ be two sets of points in a n-dimensional Euclidean space. Then $X_0$ and $X_1$ are linearly separable if there are $n+1$ real numbers $w_1,w_2,…w_n,k$ such that for any $x \in X_0 \sum_{i=1}^n w_ix_i>k$ and for any $x \in X_1 : \sum_{i=1}^n w_ix_i<k$. <=”” p=””> However, it does not give any separability measures to be used in concrete cases. </k$.>

My trick: supervised clustering

In theory

In the absence of a ready-made separability measure, I have found a way to estimate the separability of classes:

  1. Perform clustering with an algorithm appropriate to your dataset. See scikit learn page.
  2. Choose k, the number of clusters consistent with silhouette analysis. See sklearn..
  3. For these k-values, estimate the separability of the classes by measuring clusters homogeneity, see sklearn.
  4. Choose the k giving the best homogeneity.

This measure involves the conditional entropy of the class conditionally to the cluster, $H(C|K)$, normalized by the entropy of the $H(C)$ class. The lower the conditional entropy, the more important the information given by the K cluster on class C is, and therefore the more homogeneous the clusters are.
The homogeneity score $h$, limited between 0 and 1, is as follows with a maximum value of 1 (perfect homogeneity):

$$
h=1-\frac{H(C|K)}{H(C)}
$$

For more information on this measure, do read this research paper written by Rosenberg and Hirschberg.

It is therefore a supervised clustering, labels are used (involved in the calculation of conditional entropy) to optimize clustering.

In practice

The following image shows the correlation between class separability (and therefore cluster homogeneity) and the performance of several classifiers (Random Forest(RF), KNN, MLP( Multi Layer Perceptron), SVM (RBF Kernel) and Logistic Regression) for an imbalance rate of 1 percent and justifies ( I hope) the use of this class separability measure.

AUC homogeneity

Thus, no matter how much class imbalance there may be, if separability is poor, there is no point in bringing out an artillery of techniques to get around the problem. It will be better to work on the data (feature engineering, creation of new variables, discussion with an expert) to increase class separability.

Ref : The introduction of this article is inspired by this article from Baptiste Rocca

Categories: Data Science

Leave a Reply

Your email address will not be published. Required fields are marked *