Wednesday, January 20, 2021

Plucking ๐Ÿ“ Emoji from strings ๐Ÿงถ in Python ๐Ÿ without breaking them ๐Ÿบ


For a recent NLP project, I found myself working with a large number of tweets containing emoji. The goal of this project may be covered in a future post but suffice to say I needed a performant way of separating my emoji from my non-emoji data. 

The Goal

The wrong way to do it

If you're like me when you see the problem above you'll end up with something like below.
import emoji #A library containing all emoji

def filter_emoji(s):
    return "".join([c for c in s if c in emoji.UNICODE_EMOJI])

def exclude_emoji(s):
    return "".join([c for c in s if c not in emoji.UNICODE_EMOJI])

print(filter_emoji("Best time in Mexico ๐Ÿ‡ฒ๐Ÿ‡ฝ I love tacos๐ŸŒฎ๐Ÿ˜‹"))
# >> ๐Ÿ‡ฒ๐Ÿ‡ฝ, ๐ŸŒฎ, ๐Ÿ˜‹

Above we are using the emoji library to help us generate a list of all known emoji, there are ways to do this without the use of a separate library but I wanted to maintain as little code as possible. 

I was using the exact functions defined above for a while and they were working great! Until they broke...

Zero Width Joiner

When I was using the code above on my dataset I kept finding that the most frequently occurring emoji was "" (an empty string). At first I thought this was some sort of problem that came about because I was using a Jupyter notebook, later I thought it was an issue with my files encoding. I eventually came across an emoji named "Skin-tone-3" and I knew something went horribly wrong.

The Zero Width Joiner is a special emoji that was approved as part of Unicode 1.1 in 1993 and added to Emoji 11.0 in 2018. Its function is to create compound emoji by joining two or more emoji. Compound emoji are using to combine attributes to create more complex emoji without the introduction of new characters.

Black man golfing = Golfer + Black Skin Tone + ZWJ + Man
More complicated then you were expecting?

My top result was "" because I was ripping apart larger, more complex emoji. This was not my intention and would lead to data integrity issues!

A better way

Now that we know the wrong way to do this, here is a better way and what I'm currently using: 
import emoji
emoji_regex = emoji.get_emoji_regexp()

def extract_emoji(s):
    return list(set(re.findall(emoji_regex, s)))

print(extract_emoji("this is a test ๐Ÿ™๐Ÿฟ‍♀️, ๐Ÿคต๐Ÿฟ, ๐Ÿ‘จ‍๐Ÿ‘ฉ‍๐Ÿ‘ฆ"))
# Returns ["๐Ÿ™๐Ÿฟ‍♀️", "๐Ÿคต๐Ÿฟ", "๐Ÿ‘จ‍๐Ÿ‘ฉ‍๐Ÿ‘ฆ"]

In this new example we are using regular expressions! With the help of the get_emoji_regexp() function from the emoji library we can easily compile a regular expression of all emoji. We then use this compiled regular expression with the re.findall() function to find all occurrences of these emoji in our string.

In this case it is important that we use the more recent release of the emoji library. By using the most recent release we can be sure we are supporting all new emoji published by the unicode consortium. 

No comments:

Post a Comment