Background
For a recent NLP project, I found myself working with a large number of tweets containing emoji. The goal of this project may be covered in a future post but suffice to say I needed a performant way of separating my emoji from my non-emoji data.
The Goal |
The wrong way to do it
If you're like me when you see the problem above you'll end up with something like below.
import emoji #A library containing all emoji def filter_emoji(s): return "".join([c for c in s if c in emoji.UNICODE_EMOJI]) def exclude_emoji(s): return "".join([c for c in s if c not in emoji.UNICODE_EMOJI]) print(filter_emoji("Best time in Mexico ๐ฒ๐ฝ I love tacos๐ฎ๐")) # >> ๐ฒ๐ฝ, ๐ฎ, ๐
I was using the exact functions defined above for a while and they were working great! Until they broke...
Zero Width Joiner
When I was using the code above on my dataset I kept finding that the most frequently occurring emoji was "" (an empty string). At first I thought this was some sort of problem that came about because I was using a Jupyter notebook, later I thought it was an issue with my files encoding. I eventually came across an emoji named "Skin-tone-3" and I knew something went horribly wrong.
The Zero Width Joiner is a special emoji that was approved as part of Unicode 1.1 in 1993 and added to Emoji 11.0 in 2018. Its function is to create compound emoji by joining two or more emoji. Compound emoji are using to combine attributes to create more complex emoji without the introduction of new characters.
Black man golfing = Golfer + Black Skin Tone + ZWJ + Man More complicated then you were expecting? |
My top result was "" because I was ripping apart larger, more complex emoji. This was not my intention and would lead to data integrity issues!
A better way
Now that we know the wrong way to do this, here is a better way and what I'm currently using:
import emoji
emoji_regex = emoji.get_emoji_regexp() def extract_emoji(s): return list(set(re.findall(emoji_regex, s))) print(extract_emoji("this is a test ๐๐ฟ♀️, ๐คต๐ฟ, ๐จ๐ฉ๐ฆ")) # Returns ["๐๐ฟ♀️", "๐คต๐ฟ", "๐จ๐ฉ๐ฆ"]
In this new example we are using regular expressions! With the help of the get_emoji_regexp() function from the emoji library we can easily compile a regular expression of all emoji. We then use this compiled regular expression with the re.findall() function to find all occurrences of these emoji in our string.
In this case it is important that we use the more recent release of the emoji library. By using the most recent release we can be sure we are supporting all new emoji published by the unicode consortium.
No comments:
Post a Comment