Friday, February 5, 2021

An easy way to undersample Imbalanced ⚖️ Multi-Label ๐Ÿท data


Recently I've been working on an NLP project using the fastai library, which I outlined in my last post. I'm working towards creating a Multi-Label Classifier, meaning that the output can be of varying length with varying binary classes. The Multi-Label nature of this project makes manipulating training data extra difficult. In this post, I'll address how I overcame some of the hurdles I've overcame involving resampling and class imbalance.

What is imbalanced data?

Data is imbalanced when the frequency of occurrences between its different classes aren't equal. In the simplest case of binary classification unequal classes lead to a situation where there are a majority and a minority class. 

Class imbalance for binary data

Imbalanced data causes problems when we train models on it. Many popular models are developed with the the assumption that class frequencies is roughly equal. This means that models trained on imbalanced data will favour predicting the majority class. These problems become worse when considering that oftentimes the class of interest is the minority class! 

Lets say we were to develop an image classifier, classifying snakes as poisonous or non-poisonous. If we  trained our model on data consisting primarily of non-poisonous snakes, we would have a high likelihood of predicting a poisonous snake as non-poisonous (False-Negative), Not good!

For multi-class and multi-label applications the trend is the same. Classes with higher frequency are more likely to be predicted. The larger the imbalance the more of a problem this can become.

Addressing the problem

For the case of Binary and Multi-Class classification, the solution is simple. Either undersample the majority or oversample the minority classes. A lot of people have written about this before so I'll share a link to a Towards Data Science article.

Distribution of multi-class data
In the case of Multi-Label classification, things are a bit more difficult. What if on a single row in our training data we have a high-frequency AND a low-frequency label? In this case, we don't want to drop the row because we don't want to ommit a rare label but we also don't want to include a high-frequency label. In a complex case like this, we need a way to quantify the trade-off in keeping or removing a row.

The solution is to score each row based on the frequency of its labels and sample based on that score. A row consisting souly of rare emoji should have a very low score, this way it is unlikely to be omitted.


Here is the code I used to implement the logic I detailed above. Please note in the example below the label is a string containing comma separated values (ex. "1, 2, 3"). This could be easily altered to work on a list, in this case we could remove the str.split() method.

# Calculate the frequency of our labels and normalize
freq = df.label.str.split(",").explode().value_counts(normalize=True)
# Calculate the rarity of the row based on the frequencies of its members
df["rarity"] = df.label.str.split(",").apply(lambda x: (1 / (sum([freq[emoji]**2 for emoji in x]) / len(x))))
# Sample 50% of the the data based on its rarity
df = df.sample(frac=0.5, weights="rarity")

Now lets take a deep dive into the code above and see how it works. 

In the example above we have already run our rarity scoring on the multi-label column yielding the rarity column. If we wanted to sample two observations from this data we would drop our first row even though it contains some rare results. We drop it because the overall score of the row is the lowest. It is best to implement this solution iterativly. If we sample 50% of the data at a time before re-calculating frequencies we can be sure that the frequencies used in the calculation are accurate.

Wednesday, January 20, 2021

Plucking ๐Ÿ“ Emoji from strings ๐Ÿงถ in Python ๐Ÿ without breaking them ๐Ÿบ


For a recent NLP project, I found myself working with a large number of tweets containing emoji. The goal of this project may be covered in a future post but suffice to say I needed a performant way of separating my emoji from my non-emoji data. 

The Goal

The wrong way to do it

If you're like me when you see the problem above you'll end up with something like below.
import emoji #A library containing all emoji

def filter_emoji(s):
    return "".join([c for c in s if c in emoji.UNICODE_EMOJI])

def exclude_emoji(s):
    return "".join([c for c in s if c not in emoji.UNICODE_EMOJI])

print(filter_emoji("Best time in Mexico ๐Ÿ‡ฒ๐Ÿ‡ฝ I love tacos๐ŸŒฎ๐Ÿ˜‹"))
# >> ๐Ÿ‡ฒ๐Ÿ‡ฝ, ๐ŸŒฎ, ๐Ÿ˜‹

Above we are using the emoji library to help us generate a list of all known emoji, there are ways to do this without the use of a separate library but I wanted to maintain as little code as possible. 

I was using the exact functions defined above for a while and they were working great! Until they broke...

Zero Width Joiner

When I was using the code above on my dataset I kept finding that the most frequently occurring emoji was "" (an empty string). At first I thought this was some sort of problem that came about because I was using a Jupyter notebook, later I thought it was an issue with my files encoding. I eventually came across an emoji named "Skin-tone-3" and I knew something went horribly wrong.

The Zero Width Joiner is a special emoji that was approved as part of Unicode 1.1 in 1993 and added to Emoji 11.0 in 2018. Its function is to create compound emoji by joining two or more emoji. Compound emoji are using to combine attributes to create more complex emoji without the introduction of new characters.

Black man golfing = Golfer + Black Skin Tone + ZWJ + Man
More complicated then you were expecting?

My top result was "" because I was ripping apart larger, more complex emoji. This was not my intention and would lead to data integrity issues!

A better way

Now that we know the wrong way to do this, here is a better way and what I'm currently using: 
import emoji
emoji_regex = emoji.get_emoji_regexp()

def extract_emoji(s):
    return list(set(re.findall(emoji_regex, s)))

print(extract_emoji("this is a test ๐Ÿ™๐Ÿฟ‍♀️, ๐Ÿคต๐Ÿฟ, ๐Ÿ‘จ‍๐Ÿ‘ฉ‍๐Ÿ‘ฆ"))
# Returns ["๐Ÿ™๐Ÿฟ‍♀️", "๐Ÿคต๐Ÿฟ", "๐Ÿ‘จ‍๐Ÿ‘ฉ‍๐Ÿ‘ฆ"]

In this new example we are using regular expressions! With the help of the get_emoji_regexp() function from the emoji library we can easily compile a regular expression of all emoji. We then use this compiled regular expression with the re.findall() function to find all occurrences of these emoji in our string.

In this case it is important that we use the more recent release of the emoji library. By using the most recent release we can be sure we are supporting all new emoji published by the unicode consortium.