Friday, February 5, 2021

An easy way to undersample Imbalanced ⚖️ Multi-Label ๐Ÿท data


Recently I've been working on an NLP project using the fastai library, which I outlined in my last post. I'm working towards creating a Multi-Label Classifier, meaning that the output can be of varying length with varying binary classes. The Multi-Label nature of this project makes manipulating training data extra difficult. In this post, I'll address how I overcame some of the hurdles I've overcame involving resampling and class imbalance.

What is imbalanced data?

Data is imbalanced when the frequency of occurrences between its different classes aren't equal. In the simplest case of binary classification unequal classes lead to a situation where there are a majority and a minority class. 

Class imbalance for binary data

Imbalanced data causes problems when we train models on it. Many popular models are developed with the the assumption that class frequencies is roughly equal. This means that models trained on imbalanced data will favour predicting the majority class. These problems become worse when considering that oftentimes the class of interest is the minority class! 

Lets say we were to develop an image classifier, classifying snakes as poisonous or non-poisonous. If we  trained our model on data consisting primarily of non-poisonous snakes, we would have a high likelihood of predicting a poisonous snake as non-poisonous (False-Negative), Not good!

For multi-class and multi-label applications the trend is the same. Classes with higher frequency are more likely to be predicted. The larger the imbalance the more of a problem this can become.

Addressing the problem

For the case of Binary and Multi-Class classification, the solution is simple. Either undersample the majority or oversample the minority classes. A lot of people have written about this before so I'll share a link to a Towards Data Science article.

Distribution of multi-class data
In the case of Multi-Label classification, things are a bit more difficult. What if on a single row in our training data we have a high-frequency AND a low-frequency label? In this case, we don't want to drop the row because we don't want to ommit a rare label but we also don't want to include a high-frequency label. In a complex case like this, we need a way to quantify the trade-off in keeping or removing a row.

The solution is to score each row based on the frequency of its labels and sample based on that score. A row consisting souly of rare emoji should have a very low score, this way it is unlikely to be omitted.


Here is the code I used to implement the logic I detailed above. Please note in the example below the label is a string containing comma separated values (ex. "1, 2, 3"). This could be easily altered to work on a list, in this case we could remove the str.split() method.

# Calculate the frequency of our labels and normalize
freq = df.label.str.split(",").explode().value_counts(normalize=True)
# Calculate the rarity of the row based on the frequencies of its members
df["rarity"] = df.label.str.split(",").apply(lambda x: (1 / (sum([freq[emoji]**2 for emoji in x]) / len(x))))
# Sample 50% of the the data based on its rarity
df = df.sample(frac=0.5, weights="rarity")

Now lets take a deep dive into the code above and see how it works. 

In the example above we have already run our rarity scoring on the multi-label column yielding the rarity column. If we wanted to sample two observations from this data we would drop our first row even though it contains some rare results. We drop it because the overall score of the row is the lowest. It is best to implement this solution iterativly. If we sample 50% of the data at a time before re-calculating frequencies we can be sure that the frequencies used in the calculation are accurate.

Wednesday, January 20, 2021

Plucking ๐Ÿ“ Emoji from strings ๐Ÿงถ in Python ๐Ÿ without breaking them ๐Ÿบ


For a recent NLP project, I found myself working with a large number of tweets containing emoji. The goal of this project may be covered in a future post but suffice to say I needed a performant way of separating my emoji from my non-emoji data. 

The Goal

The wrong way to do it

If you're like me when you see the problem above you'll end up with something like below.
import emoji #A library containing all emoji

def filter_emoji(s):
    return "".join([c for c in s if c in emoji.UNICODE_EMOJI])

def exclude_emoji(s):
    return "".join([c for c in s if c not in emoji.UNICODE_EMOJI])

print(filter_emoji("Best time in Mexico ๐Ÿ‡ฒ๐Ÿ‡ฝ I love tacos๐ŸŒฎ๐Ÿ˜‹"))
# >> ๐Ÿ‡ฒ๐Ÿ‡ฝ, ๐ŸŒฎ, ๐Ÿ˜‹

Above we are using the emoji library to help us generate a list of all known emoji, there are ways to do this without the use of a separate library but I wanted to maintain as little code as possible. 

I was using the exact functions defined above for a while and they were working great! Until they broke...

Zero Width Joiner

When I was using the code above on my dataset I kept finding that the most frequently occurring emoji was "" (an empty string). At first I thought this was some sort of problem that came about because I was using a Jupyter notebook, later I thought it was an issue with my files encoding. I eventually came across an emoji named "Skin-tone-3" and I knew something went horribly wrong.

The Zero Width Joiner is a special emoji that was approved as part of Unicode 1.1 in 1993 and added to Emoji 11.0 in 2018. Its function is to create compound emoji by joining two or more emoji. Compound emoji are using to combine attributes to create more complex emoji without the introduction of new characters.

Black man golfing = Golfer + Black Skin Tone + ZWJ + Man
More complicated then you were expecting?

My top result was "" because I was ripping apart larger, more complex emoji. This was not my intention and would lead to data integrity issues!

A better way

Now that we know the wrong way to do this, here is a better way and what I'm currently using: 
import emoji
emoji_regex = emoji.get_emoji_regexp()

def extract_emoji(s):
    return list(set(re.findall(emoji_regex, s)))

print(extract_emoji("this is a test ๐Ÿ™๐Ÿฟ‍♀️, ๐Ÿคต๐Ÿฟ, ๐Ÿ‘จ‍๐Ÿ‘ฉ‍๐Ÿ‘ฆ"))
# Returns ["๐Ÿ™๐Ÿฟ‍♀️", "๐Ÿคต๐Ÿฟ", "๐Ÿ‘จ‍๐Ÿ‘ฉ‍๐Ÿ‘ฆ"]

In this new example we are using regular expressions! With the help of the get_emoji_regexp() function from the emoji library we can easily compile a regular expression of all emoji. We then use this compiled regular expression with the re.findall() function to find all occurrences of these emoji in our string.

In this case it is important that we use the more recent release of the emoji library. By using the most recent release we can be sure we are supporting all new emoji published by the unicode consortium. 

Sunday, August 2, 2020

Taking an iterative approach to Data Science ๐Ÿ”„


A few months back I had an idea. Wouldn't it be cool to combine my two favourite interests, Food and Data? I got to thinking about different ways I could apply my skills to food-related applications. At the start, I struggled to find direction. How could I produce something meaningful from food data? A deep neural network generating recipes? Computer vision to detect the ideal roasted marshmallow for smores? I was at a total loss.
Then it hit me, data and food already live together in perfect harmony. Restaurant Menus were the key! They're a structured data source that's pairs dishes with rich-text descriptions and numeric prices!
Restaurant Menus - The ideal way to combine my two interests. 

Problem Identification

What problem was I looking to solve though? My project wasn't going to be useful to my users unless it is solving a problem they experience. I set out to better understand the process of designing and setting menus.

I reached out to different communities of chefs and restaurant owners on Reddit. The answers I received varied a lot. Different establishments followed different processes and there wasn't much of an industry standard. The process seems incredibly informal when considering that it is the most important marketing tool of a restaurant

I noticed that restaurants with visually appealing menus weren't designed in-house. I learned in North America it's common for restaurants to partner with a liquor supplier. As an added bonus to the deal, the liquor supplier may offer to help with designing the restaurant menu. The downside of this is that it locks restaurants into a menu and forbids rapid revisions.

With the information above I had scoped out the problem I wanted to solve. It seems that there's an opportunity to make a "Smart" web-based menu designer. This web app will remove the barriers to allow for non-designers to put together aesthetically pleasing menus in minutes. If the app gets any traction I play to utilize user data to train Machine Learning models to enhance the menus created. I'll call in Popinly

Solutions Design

To build any intelligence into this product I know that I'm going to need a lot of data. Building Machine Learning into a product from its genesis is a daunting task. It is best to start simple and build in Data Science after an initial release. To decide on the best work plan for myself, I put together an end state for this product and I'll work backwards from there.

The End State

The end state of this product will incorporate two distinct Machine Learning features.

The first feature would be a "Description Enhancer". Given an item's title, this tool will recommend adjectives to enhance the description. To build this product I'll need a large dataset of menu items. Using this dataset I could produce embeddings to generate future predictions.

The Description Enhancer

The second feature would be a price predictor. Given an item's title, description and other features, this product will predict a fair price. This product needs an immense dataset as geography will play a large role in price prediction.

Machine Learning-based tools integrated into the app


To increase the odds of reaching the end state I plan to take an iterative approach to development. Great Machine Learning features need a solid foundation to stand on. I plan on releasing Popinly in four stages.

Release One - The minimum viable product (MVP). At this stage, users will only be able to design basic menus and export them as PDF's. I'm choosing PDFs as the default format as they are portable and should be a familiar file type to all users. I need a strong user base creating menus to generate the dataset required to produce Machine Learning based features.

I aim to release the MVP before I start any further developments. Any developments beyond this could change drastically based on user feedback.

Release Two - New Formats and Styles. This update will expand on the types of formats and styles of menus Popinly produces. I enjoy seeing the menus of my favourite restaurants on Instagram. I plan to provide an option to export menus as PNGs styled for Instagram stories.

Release Three - Description Enhancer. By this point, I aim to have some users and hopefully will have a sizeable dataset. I'll need these inputs to produce this feature.


Here is a link to my product and repo, feel free to follow along!

Saturday, July 4, 2020

Is America coming together or drifting apart with time? ๐Ÿ˜ฐ


In part two of this series of articles, we set the stage for making graphs from our dataset using principal component analysis (PCA). In this post, we will finally get into the data and draw conclusions from our speech data. 

Diving into the Data

Figure 1 - Speeches by Decade
Since our dataset spans such a broad range (1789 - 2019), let's start by visualizing our data over time. To do this, we can aggregate our speeches by decade and plot the PCA components for each decade. The result of this can be seen in figure-1. It is interesting to note that the decades don't just form a single trend. A given decade's closest neighbour can be sixty to one-hundred years away from it, some interesting examples of this are 1800 - 1900 and 1990 - 2010. With more knowledge of American History, there are likely some interesting conclusions that could result from this data, but I'll leave it up to the reader to draw their own conclusions. It is worth noting that going from 100D to 3D introduces a-lot of error, and all results should be taken with caution. 

The Modern Era

In Figure-1, it can be seen that speeches are tending to clump together with time. This trend is most evident in the modern era (1940's onwards) and can be seen in greater detail in Figure-2. The strong clumping of decades insinuates that there has been consistent word usage in the modern era. If we are to take Figure-2 at face value. The tighter grouping of the Modern Era (Yellow circle) shows that word usage has become less diverse. In contrast, the Previous Era (Red circle) shows that decades before 1940 exhibited very unique word usage and were very unique.

Figure 2 - Modern vs previous era comparison

Digging Deeper

One of the most intuitive ways to compare eras is to analyze word usage. In Figure-3, we can see a comparison of the top 10 most frequently used words by era. Next to each word in the modern era, we can see how the frequency of each word has altered between eras.

Figure 3 - Modern vs. Previous Rhetoric

Comparing the word usage of the modern era to the previous, several trends jump out to me. Firstly it seems that in the modern era, there are a lot more mentions to people (People, Us, President, American). Secondly, there appears to be much more urgency with the addition of "must" in the modern era. Thirdly, the additions of more geographic words (World and America) adds a more global flavour to the modern era. These three characteristics combined (Individualism, Urgency and Geography) are the three central themes that appear most in the modern era.

In the previous era, the unique rhetoric consists of States, Government, United, Congress and Country. To me, these words all seem very common and can be grouped together as "government-y" words. In my opinion, these words appeal more to government figures (senators, congresspeople, etc) than to the people. These differences make sense when contrasting the communication mediums of the time (absence of radio and television).

Thursday, March 5, 2020

How can we see in high-dimensional space? ๐Ÿ™ˆ


In part one of this series, we took presidential speech data and through the process of Stylometry, we calculated numeric fingerprints for each speech. In this post, we will cover the mathematical process needed to set the stage for visualizations in further posts.

Note: This is an optional post and is very math-heavy. Feel free to skip to post three if you aren't interested in the underlying technology. 

How can we visualize 100 Dimensional Data?

The result of Part One was fingerprints of our speech dataset. These fingerprints cannot easily be visualized as they are high dimensional data. Data is considered high dimensional if its dimensionality is "staggeringly high." In our case, 100 dimensions is very high. Knowing this limitation, it would be handy if there were an easy way of reducing the dimensionality of our data set so we can visualize it. Thankfully we can do this using principal component analysis (PCA). 

To understand PCA, let's look at some graphs. Below in figure-1, we can see an animation orbiting 3-D data. From each of these perspectives, the data looks different, but the underlying data never changes.

Figure 1 - Orbiting 3D Data
How would we go about projecting this data into 2-D? Each different perspective, we view the 3-D data from results in a different 2-D projection!

Let's apply some real-world intuition. If we were in a room with this cloud of data, we could move around it and shine a flashlight ๐Ÿ”ฆonto it. The shadow cast from the flashlight is a 2-D projection of our 3-D data. The goal of PCA is to produce the most accurate projection of our data.

The projection is accurate when it preserves as much of the variance as possible. To apply this to our real-word example: We want the shadow cast by our flashlight to be as tall and wide as possible.

Figure 2 - Result of our PCA: 2D representation of 3D data
No matter how great our PCA there is always error involved in reducing dimensionality. This error is essential to keep in mind when making significant reductions in dimensionality (like we're doing) as trends in the data may be overlooked.

The Results

In figure-2, we can see the result of PCA on our 3-D data from figure-1. This projection of our dataset contains the trend visible in the original data and does a good job preserving variance.

Applying PCA to our fingerprint Data

To apply PCA to our fingerprint data, we utilize the scikit-learn package in python, this gives us an easy way to apply our model to our data. We can provide the function below fingerprint_pca our high dimensional data, and we are returned with the columns of our PCA components along with the error associated with it.

def fingerprint_pca(data, n_components, fingerprint_col):
    """Return Principal Component Analysis of a fingerprint column. 
    Fingerprint column is expected to contain a series of list

    Keyword arguments:
    data -- dataframe containing stylometric fingerprint data
    n_components -- number of components to return
    fingerprint_col -- column name containing fingerprint data
    def _unnester(df, explode):
        """Return a unnested series of columns given a column of nested lists"""
        df1 = pd.concat([pd.DataFrame(df[x].tolist(), index=df.index)
            .add_prefix(x) for x in [explode]], axis=1)
        return df1.join(df.drop(explode, 1), how='left')
    df_explode = _unnester(data, fingerprint_col)
    pca_model = PCA(n_components).fit(df_explode.iloc[:,0:100].to_numpy()) 
    pca_res = pca_model.transform(df_explode.iloc[:,0:100].to_numpy())    
    pca = pd.DataFrame(pca_res, 
                       columns=["x", "y"]).reset_index().merge(data, on="uid", how="inner")
    return pca

df_pca = fingerprint_pca(data=df, n_components=2, fingerprint_col="fingerprint")
Note: The function above is a specific application of PCA to our fingerprint data, for a more generic example reference the API documentation.

At this point It's ok if you don't understand PCA, sometimes I feel like I don't understand it myself. That being said, it's an essential transformation and will be used for visualizations in future posts.

Tuesday, February 25, 2020

Listening to dead presidents ๐Ÿ‘‚


I've always been a horrible speller (you probably know this by now if you're reading this blog). When I was in grade school, English was always my lowest mark, and I've never enjoyed reading fiction. As a consequence of this, I'm mystified by the intersections of Data Science and Written Language, known as Natural Language Processing (NLP). Knowing that there are chatbots that communicate at a higher grade level than myself, I figure that I should understand how they work.

A while back, I heard a theory that Shakespeare might have been a pen name for multiple authors and that NLP could be a tool used to validate this theory. The specific branch of NLP related to this topic is Stylometry, a technique used to analyze writing styles. I figured this could be an engaging starting point.

I figured presidential speeches would be an interesting dataset as it contained a variety of authors, spanned a long time range, and included a wide breadth of subject matter. Finding a complete set of speeches was next to impossible, so I reached out to The Miller Center, a nonpartisan affiliate of the University of Virginia that specializes in presidential scholarship and political history. I received a response incredibly quickly (thanks, Miles J.) and got to work on applying Stylometry to the dataset.

Preprocessing and Exploration

To start, we need to load our data. Since the file is JSON,  each row is a key-value pair that we need to expand. While we're at it, we can parse our dates and create a unique identifier for each speech; this will come in handy later.

cols = ["title", 

# Expand dictionary
df[cols] = df.speeches.apply(pd.Series)

# Drop unexpanded dictionary
df = df.drop('speeches', axis=1)

# Parse Datetime and expand
df.loc[:,"date"] = pd.to_datetime(
df.loc[:,"year"] = pd.DatetimeIndex(
df.loc[:,"month"] = pd.DatetimeIndex(
df.loc[:,"day"] = pd.DatetimeIndex(
df.loc[:,"decade"] = divmod(df["year"], 10)[0] * 10

# Create Unique ID column to use for joining
df["uid"] = pd.util.hash_pandas_object(df)

I decided to join the data with another data source. I had a hunch that the political party of the president may be an inciteful feature to aggregate on. This feature may not come in handy, but getting the mappings isn't too bad, as you can found them on Wikipedia.

I'm not an expert in US politics, but the figure to the left seems to line up with my intuition. Visualizing the number of total speeches by party, we can see that the majority of our data fall under either the Democratic or Republican parties. Since these are the two major parties of the modern era, this result makes sense.

Looking at speeches by decade over time, we can see another trend. It seems that as time progresses, so does the frequency of speeches. Again this isn't surprising.


Disclaimer: I'm not an expert in Stylometry; in fact, before starting this article, I didn't even know what it was called. Please take everything I say with caution. 

The speeches in our dataset come in a variety of different formats and require a fair amount of preprocessing before we can make comparisons. Some of the data contain HTML formatting, speaker labels or other annotations. To work around this, we trim all punctuation and stop words (uninteresting words such as "and," "or," "so") from the text using the function seriesToCorpus below.

from nltk.tokenize import word_tokenize # For NLP word tokenizing speeches
from nltk.corpus import stopwords # For filtering out boring words
import string # For filtering out punctuation

def seriesToCorpus(input_text, LOWER=True, TRIM_PUNCT=True):
    """Remove stopwords and punctuation from text to produce tokens
    Keyword arguments:
    input_text -- list or series of lists containing free text speeches 
    LOWER -- flag to set all tokens to lower-case (Default: True)
    TRIM_PUNCT -- flag to remove all punctuation from tokens (Default: True)
    # Allow for both Series and Cell inputs to insure consistent processing
    if type(input_text) == pd.core.series.Series:
        input_text = input_text.to_list()
        input_text = [input_text]
    # Load common english stop words (and, an, the, who)
    stop_words = set(stopwords.words('english')) 

    # If flag is set make all tokens lower-case
    if LOWER:
        tok_speech = [word_tokenize(speech.lower()) for speech in input_text]
        tok_speech = [word_tokenize(speech) for speech in input_text]
    output = list()
    for speech in tok_speech:
    # If flag is set trim all punctuation 
    if TRIM_PUNCT:
        punct = list(string.punctuation)+["--","'s","’","\'\'","``"]
        output = [w for w in output if not w in punct] 
    output = [w for w in corpus if not w in stop_words]

    return output

corpus = seriesToCorpus(df.text)
df['tokenized'] = df.apply(lambda x: seriesToCorpus(x.text), axis=1)

This function seriesToCorpus provides us with a way to tokenize our text consistently. These tokens are what we will use later on for further analysis. Without a comprehensive method to tokenize text, we will end up with garbage text making its way to our output and diluting our results. For example, if we did not trim off stop words, we would end up using conjunctions as the bases for our comparison in the next few steps.

If we are to combine all of our tokenized speeches, then we end up with a big list of words that we can use for comparisons. We'll refer to this as the corpus.

Tokenizing speeches and converted to a joint sorted corpus

With our corpus, we can scan for the top one hundred most frequently occurring words and use that to make comparisons. Since the corpus, by definition, contains every speech in our dataset, the top terms are an ideal way to make comparisons.

With the top words, we can make fingerprints for each speech. We do so by taking each of our most frequently occurring results from the corpus and calculating the frequency it occurs for a given speech. For our top hundred words, this gives us a 1 by 100 fingerprint for each speech. These tasks are accomplished by the functions getTopFreqWords and getFingerprint below.

from nltk.probability import FreqDist # For Frequency Distributions

def getTopFreqWords(corpus, n):
    """Return the n most frequently occurring words from a body of text"""
    fdist = FreqDist(corpus)
    return [x[0] for x in fdist.most_common(n)]

def getFingerprint(text, topWordsList):
    """Return a list of frequencies for each string in a list of strings"""
    output = list()
    for text_key in topWordsList:
    return list(output)

top_hundred = getTopFreqWords(corpus, 100)
df['fingerprint'] = df.apply(lambda x: getFingerprint(x.tokenized, top_hundred), axis=1)

Double Checking (Optional)

At this point, we've made a lot of changes to our dataset. Each operation introduces the potential for bugs that can skew our results. To account for this, we can reference back to our UID column to ensure that the changes we made to our dataset went as planned. Calling the sum of the UID column before and after the transformations above produced the same result. Obtaining the same sum means that no columns have been inserted or removed, only modified as we planned.

# Returns 8702334536768193124

# Data processing goes here

# Returns 8702334536768193124

Wrapping things up

With the fingerprints we just made, we now have a numeric representation of every speech. This numeric array should seem a bit more familiar and lends itself a bit better to being used for further data processing and comparisons (Clustering, PCA and other analysis). In the next articles, we will use these fingerprints to compare the speech styles of parties, presidents and eras.

TLDR: Calculated the ratio of top word occurrences for a bunch of presidential speeches. Next, we'll make pretty graphs with it.

Tuesday, June 18, 2019

"Hacking" Websites for all their assets

Recently I re-watched the movie The Social Network. Youtube had been recommending me clips of the film for a while, so I finally broke down and watched it. One of my favourite scenes from the movie is the hacking montage, where Mark Zuckerberg uses some "wget magic" to download all the images off a website. I've recently been working on a project that crossed into the wget domain, so I'll cover some of my learning here. 

Disclaimer: Misusing wget can get your IP banned from sites. Always check the robots.txt file.

Before I get any further, it is essential to understand the robots.txt file. Robots.txt serves as a way to tell robots (wget and other scrapers) visiting the site where they are allowed to go. By default, wget will read this file and ignore files and folders it's told to stay away from. This tendency to follow the rules can be turned off by specifying the flag -e robots=off. It is considered proper etiquette to leave this on, though, as I can only imagine how annoying it would be to be a web admin and have someone use wget to spam you continuously.

The Mission

My goal with this task was to get sprite-maps that could be used in a later image recognition project. The issue with anything related to image recognition (or furthermore ML) is that you need an extensive data set to get started with. To overcome this issue, I found a website that crowdsources sprite-maps from retro games. I went through a few sites but figured if I wanted to avoid the problems with duplication, I'd have to stick to one. In the end, I settled on Spriters Resource because they had a lot of pictures, and I wouldn't be breaking their robots.txt.

The Scrape

Our target organizes their images by the console it originated from. This feature is handy because it means we can bound our retrieval to make sure we aren't getting any data we don't require. Once we figure out our console of interest, we can utilize some recursive features to crawl down the file tree for said console. 

The command:

wget -nd -p -r -np -w 1 -A png

wget: The utility we're going to use for scraping. See docs here

-nd: Flattens the file hierarchy, if we retrieve nested files they will be placed in the current folder

 -p: Download all files referenced on a page, without this we will just have references to images

-r: Enable recursive downloading, this makes wget crawl down the file structure

-np: Bound wget to the parent directory when retrieving recursively, this makes sure we don't follow links to places we don't want to go

-w 1: Wait one second between retrievals, this slows us down a lot but makes sure we aren't spamming

 -A png: Accepted file extensions, makes sure we will only save pictures.

- Our base URL to start at

In the command above, if we switch the "base-folder/" text with any console contained by the website, then we can retrieve all of our resources. Happy Scraping!