Pages

Friday, February 5, 2021

An easy way to undersample Imbalanced ⚖️ Multi-Label 🏷 data

Foreword

Recently I've been working on an NLP project using the fastai library, which I outlined in my last post. I'm working towards creating a Multi-Label Classifier, meaning that the output can be of varying length with varying binary classes. The Multi-Label nature of this project makes manipulating training data extra difficult. In this post, I'll address how I overcame some of the hurdles I've overcame involving resampling and class imbalance.

What is imbalanced data?

Data is imbalanced when the frequency of occurrences between its different classes aren't equal. In the simplest case of binary classification unequal classes lead to a situation where there are a majority and a minority class. 

Class imbalance for binary data


Imbalanced data causes problems when we train models on it. Many popular models are developed with the the assumption that class frequencies is roughly equal. This means that models trained on imbalanced data will favour predicting the majority class. These problems become worse when considering that oftentimes the class of interest is the minority class! 

Lets say we were to develop an image classifier, classifying snakes as poisonous or non-poisonous. If we  trained our model on data consisting primarily of non-poisonous snakes, we would have a high likelihood of predicting a poisonous snake as non-poisonous (False-Negative), Not good!

For multi-class and multi-label applications the trend is the same. Classes with higher frequency are more likely to be predicted. The larger the imbalance the more of a problem this can become.

Addressing the problem

For the case of Binary and Multi-Class classification, the solution is simple. Either undersample the majority or oversample the minority classes. A lot of people have written about this before so I'll share a link to a Towards Data Science article.

Distribution of multi-class data
In the case of Multi-Label classification, things are a bit more difficult. What if on a single row in our training data we have a high-frequency AND a low-frequency label? In this case, we don't want to drop the row because we don't want to ommit a rare label but we also don't want to include a high-frequency label. In a complex case like this, we need a way to quantify the trade-off in keeping or removing a row.

The solution is to score each row based on the frequency of its labels and sample based on that score. A row consisting souly of rare emoji should have a very low score, this way it is unlikely to be omitted.



Code

Here is the code I used to implement the logic I detailed above. Please note in the example below the label is a string containing comma separated values (ex. "1, 2, 3"). This could be easily altered to work on a list, in this case we could remove the str.split() method.

# Calculate the frequency of our labels and normalize
freq = df.label.str.split(",").explode().value_counts(normalize=True)
# Calculate the rarity of the row based on the frequencies of its members
df["rarity"] = df.label.str.split(",").apply(lambda x: (1 / (sum([freq[emoji]**2 for emoji in x]) / len(x))))
# Sample 50% of the the data based on its rarity
df = df.sample(frac=0.5, weights="rarity")

Now lets take a deep dive into the code above and see how it works. 


In the example above we have already run our rarity scoring on the multi-label column yielding the rarity column. If we wanted to sample two observations from this data we would drop our first row even though it contains some rare results. We drop it because the overall score of the row is the lowest. It is best to implement this solution iterativly. If we sample 50% of the data at a time before re-calculating frequencies we can be sure that the frequencies used in the calculation are accurate.