Foreword
Recently I've been working on an NLP project using the fastai library, which I outlined in my last post. I'm working towards creating a Multi-Label Classifier, meaning that the output can be of varying length with varying binary classes. The Multi-Label nature of this project makes manipulating training data extra difficult. In this post, I'll address how I overcame some of the hurdles I've overcame involving resampling and class imbalance.
What is imbalanced data?
Data is imbalanced when the frequency of occurrences between its different classes aren't equal. In the simplest case of binary classification unequal classes lead to a situation where there are a majority and a minority class.
Class imbalance for binary data |
Imbalanced data causes problems when we train models on it. Many popular models are developed with the the assumption that class frequencies is roughly equal. This means that models trained on imbalanced data will favour predicting the majority class. These problems become worse when considering that oftentimes the class of interest is the minority class!
Lets say we were to develop an image classifier, classifying snakes as poisonous or non-poisonous. If we trained our model on data consisting primarily of non-poisonous snakes, we would have a high likelihood of predicting a poisonous snake as non-poisonous (False-Negative), Not good!
For multi-class and multi-label applications the trend is the same. Classes with higher frequency are more likely to be predicted. The larger the imbalance the more of a problem this can become.
Addressing the problem
Distribution of multi-class data |
Code
# Calculate the frequency of our labels and normalize freq = df.label.str.split(",").explode().value_counts(normalize=True) # Calculate the rarity of the row based on the frequencies of its members df["rarity"] = df.label.str.split(",").apply(lambda x: (1 / (sum([freq[emoji]**2 for emoji in x]) / len(x)))) # Sample 50% of the the data based on its rarity df = df.sample(frac=0.5, weights="rarity")