Pages

Tuesday, February 25, 2020

Listening to dead presidents 👂

Preamble

I've always been a horrible speller (you probably know this by now if you're reading this blog). When I was in grade school, English was always my lowest mark, and I've never enjoyed reading fiction. As a consequence of this, I'm mystified by the intersections of Data Science and Written Language, known as Natural Language Processing (NLP). Knowing that there are chatbots that communicate at a higher grade level than myself, I figure that I should understand how they work.

A while back, I heard a theory that Shakespeare might have been a pen name for multiple authors and that NLP could be a tool used to validate this theory. The specific branch of NLP related to this topic is Stylometry, a technique used to analyze writing styles. I figured this could be an engaging starting point.

I figured presidential speeches would be an interesting dataset as it contained a variety of authors, spanned a long time range, and included a wide breadth of subject matter. Finding a complete set of speeches was next to impossible, so I reached out to The Miller Center, a nonpartisan affiliate of the University of Virginia that specializes in presidential scholarship and political history. I received a response incredibly quickly (thanks, Miles J.) and got to work on applying Stylometry to the dataset.

Preprocessing and Exploration

To start, we need to load our data. Since the file is JSON,  each row is a key-value pair that we need to expand. While we're at it, we can parse our dates and create a unique identifier for each speech; this will come in handy later.

cols = ["title", 
        "intro",
        "president",
        "date",
        "text"]

# Expand dictionary
df[cols] = df.speeches.apply(pd.Series)

# Drop unexpanded dictionary
df = df.drop('speeches', axis=1)

# Parse Datetime and expand
df.loc[:,"date"] = pd.to_datetime(df.date)
df.loc[:,"year"] = pd.DatetimeIndex(df.date).year
df.loc[:,"month"] = pd.DatetimeIndex(df.date).month
df.loc[:,"day"] = pd.DatetimeIndex(df.date).day
df.loc[:,"decade"] = divmod(df["year"], 10)[0] * 10

# Create Unique ID column to use for joining
df["uid"] = pd.util.hash_pandas_object(df)

I decided to join the data with another data source. I had a hunch that the political party of the president may be an inciteful feature to aggregate on. This feature may not come in handy, but getting the mappings isn't too bad, as you can found them on Wikipedia.

I'm not an expert in US politics, but the figure to the left seems to line up with my intuition. Visualizing the number of total speeches by party, we can see that the majority of our data fall under either the Democratic or Republican parties. Since these are the two major parties of the modern era, this result makes sense.

Looking at speeches by decade over time, we can see another trend. It seems that as time progresses, so does the frequency of speeches. Again this isn't surprising.





Stylometry

Disclaimer: I'm not an expert in Stylometry; in fact, before starting this article, I didn't even know what it was called. Please take everything I say with caution. 

The speeches in our dataset come in a variety of different formats and require a fair amount of preprocessing before we can make comparisons. Some of the data contain HTML formatting, speaker labels or other annotations. To work around this, we trim all punctuation and stop words (uninteresting words such as "and," "or," "so") from the text using the function seriesToCorpus below.

from nltk.tokenize import word_tokenize # For NLP word tokenizing speeches
from nltk.corpus import stopwords # For filtering out boring words
import string # For filtering out punctuation

def seriesToCorpus(input_text, LOWER=True, TRIM_PUNCT=True):
    """Remove stopwords and punctuation from text to produce tokens
    
    Keyword arguments:
    input_text -- list or series of lists containing free text speeches 
    LOWER -- flag to set all tokens to lower-case (Default: True)
    TRIM_PUNCT -- flag to remove all punctuation from tokens (Default: True)
    """
    
    # Allow for both Series and Cell inputs to insure consistent processing
    if type(input_text) == pd.core.series.Series:
        input_text = input_text.to_list()
    else:
        input_text = [input_text]
    
    # Load common english stop words (and, an, the, who)
    stop_words = set(stopwords.words('english')) 

    # If flag is set make all tokens lower-case
    if LOWER:
        tok_speech = [word_tokenize(speech.lower()) for speech in input_text]
    else: 
        tok_speech = [word_tokenize(speech) for speech in input_text]
                        
    output = list()
    for speech in tok_speech:
        output.extend(speech)
        
    # If flag is set trim all punctuation 
    if TRIM_PUNCT:
        punct = list(string.punctuation)+["--","'s","’","\'\'","``"]
        output = [w for w in output if not w in punct] 
        
    output = [w for w in corpus if not w in stop_words]

    return output

corpus = seriesToCorpus(df.text)
df['tokenized'] = df.apply(lambda x: seriesToCorpus(x.text), axis=1)

This function seriesToCorpus provides us with a way to tokenize our text consistently. These tokens are what we will use later on for further analysis. Without a comprehensive method to tokenize text, we will end up with garbage text making its way to our output and diluting our results. For example, if we did not trim off stop words, we would end up using conjunctions as the bases for our comparison in the next few steps.

If we are to combine all of our tokenized speeches, then we end up with a big list of words that we can use for comparisons. We'll refer to this as the corpus.

Tokenizing speeches and converted to a joint sorted corpus

With our corpus, we can scan for the top one hundred most frequently occurring words and use that to make comparisons. Since the corpus, by definition, contains every speech in our dataset, the top terms are an ideal way to make comparisons.

With the top words, we can make fingerprints for each speech. We do so by taking each of our most frequently occurring results from the corpus and calculating the frequency it occurs for a given speech. For our top hundred words, this gives us a 1 by 100 fingerprint for each speech. These tasks are accomplished by the functions getTopFreqWords and getFingerprint below.

from nltk.probability import FreqDist # For Frequency Distributions

def getTopFreqWords(corpus, n):
    """Return the n most frequently occurring words from a body of text"""
    fdist = FreqDist(corpus)
    return [x[0] for x in fdist.most_common(n)]

def getFingerprint(text, topWordsList):
    """Return a list of frequencies for each string in a list of strings"""
    output = list()
    for text_key in topWordsList:
        output.append(FreqDist(text).freq(text_key))
    return list(output)

top_hundred = getTopFreqWords(corpus, 100)
df['fingerprint'] = df.apply(lambda x: getFingerprint(x.tokenized, top_hundred), axis=1)


Double Checking (Optional)

At this point, we've made a lot of changes to our dataset. Each operation introduces the potential for bugs that can skew our results. To account for this, we can reference back to our UID column to ensure that the changes we made to our dataset went as planned. Calling the sum of the UID column before and after the transformations above produced the same result. Obtaining the same sum means that no columns have been inserted or removed, only modified as we planned.


print(df.uid.sum())
# Returns 8702334536768193124

# Data processing goes here

print(df.uid.sum())
# Returns 8702334536768193124

Wrapping things up

With the fingerprints we just made, we now have a numeric representation of every speech. This numeric array should seem a bit more familiar and lends itself a bit better to being used for further data processing and comparisons (Clustering, PCA and other analysis). In the next articles, we will use these fingerprints to compare the speech styles of parties, presidents and eras.

TLDR: Calculated the ratio of top word occurrences for a bunch of presidential speeches. Next, we'll make pretty graphs with it.


Tuesday, June 18, 2019

"Hacking" Websites for all their assets

Recently I re-watched the movie The Social Network. Youtube had been recommending me clips of the film for a while, so I finally broke down and watched it. One of my favourite scenes from the movie is the hacking montage, where Mark Zuckerberg uses some "wget magic" to download all the images off a website. I've recently been working on a project that crossed into the wget domain, so I'll cover some of my learning here. 

Disclaimer: Misusing wget can get your IP banned from sites. Always check the robots.txt file.

Before I get any further, it is essential to understand the robots.txt file. Robots.txt serves as a way to tell robots (wget and other scrapers) visiting the site where they are allowed to go. By default, wget will read this file and ignore files and folders it's told to stay away from. This tendency to follow the rules can be turned off by specifying the flag -e robots=off. It is considered proper etiquette to leave this on, though, as I can only imagine how annoying it would be to be a web admin and have someone use wget to spam you continuously.

The Mission

My goal with this task was to get sprite-maps that could be used in a later image recognition project. The issue with anything related to image recognition (or furthermore ML) is that you need an extensive data set to get started with. To overcome this issue, I found a website that crowdsources sprite-maps from retro games. I went through a few sites but figured if I wanted to avoid the problems with duplication, I'd have to stick to one. In the end, I settled on Spriters Resource because they had a lot of pictures, and I wouldn't be breaking their robots.txt.

The Scrape

Our target organizes their images by the console it originated from. This feature is handy because it means we can bound our retrieval to make sure we aren't getting any data we don't require. Once we figure out our console of interest, we can utilize some recursive features to crawl down the file tree for said console. 

The command:


wget -nd -p -r -np -w 1 -A png https://www.spriters-resource.com/base-folder/

wget: The utility we're going to use for scraping. See docs here

-nd: Flattens the file hierarchy, if we retrieve nested files they will be placed in the current folder

 -p: Download all files referenced on a page, without this we will just have references to images

-r: Enable recursive downloading, this makes wget crawl down the file structure

-np: Bound wget to the parent directory when retrieving recursively, this makes sure we don't follow links to places we don't want to go

-w 1: Wait one second between retrievals, this slows us down a lot but makes sure we aren't spamming

 -A png: Accepted file extensions, makes sure we will only save pictures.

- https://www.spriters-resource.com/base-folder/: Our base URL to start at

In the command above, if we switch the "base-folder/" text with any console contained by the website, then we can retrieve all of our resources. Happy Scraping!

Friday, May 24, 2019

Do these numbers scare you as well?

The gif in question
Recently while browsing math related Wikipedia pages I came across a very interesting gif. The image was on the Wikipedia page for Pascal's Triangle and at first look it seemed completely out of place. I thought it must have been some sort of last minute edit to sneak in a spooky gif to an otherwise technical Wikipedia page.

My first thought was that the gif is some sort of creepy mutating dome of retro video game sprites. Each frame looks eerily familiar and I felt like I've seen this all before.

Attempting to learn more about the image proved to be an unfruitful endeavor. Reading the rest of the article and scanning the profile of the submitter yielded nothing. The article's edits page contained some discussion from the editors in which they inquired how the gif got there but nothing else.

Not only did the image seem out of place but it came alongside a very technical description that took me a while to wrap my head around. I'll include the description below along with a few images to try and show what it's trying to describe.

Each frame represents a row in Pascal's triangle. Each column of pixels is a number in binary with the least significant bit at the bottom. Light pixels represent ones and the dark pixels are zeroes.

Pascals Triangle

Binary representation of row 5
Frame 5 of the gif

I hope the images above speak to you but if it isn't doing anything for you then the entirety of this article can be simplified to something along the lines of "Math used to make interesting images that look similar to retro video game sprites".

Once I had an idea as to what I was looking at, I set out to recreate the image in Python. I'll go over some of the more important parts of the code here. A link to the complete repo with instructions for getting it running yourself can be found on my Github.

We start by generating Pascal's Triangle. I got a lot of help from stack overflow here to get started but modified the code to produce binary numbers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
def make_triangle(n_rows):
    """Return pascals triangle in binary.

    Keyword arguments:
    n_rows -- the number of rows in the triangle
    """
    # Function to add binary numbers
    def bin_add(*args): return bin(sum(int(x, 2) for x in args))[2:]

    results = []
    for _ in range(n_rows):
        row = [bin(1)]  # Append a binary 1 to the start of a row
        if results:  # If there are existing results (row > 1)
            last_row = results[-1]
            # The following is just a fancy way to say "For each result in the last row add it with its neighbor"
            # Zip functions collects the previous row with itself and a version indexed one element ahead
            # The bin_add(*pair) unpacks the pair and calls the bin_add function with the results
            row.extend([bin_add(*pair) for pair in zip(last_row, last_row[1:])])
            row.append(bin(1))  # Append a binary 1 to the end of a row
        results.append(row)
    return results

Once we have Pascal's triangle generated we can take a row of it and use it to generate a frame. This was one of those problems where retrospectively I could have saved a lot of time if I'd spent more time planning. Instead I just jumped right into it and realistically choose the wrong data types to represent some of the elements.

If I had to redo it all from scratch I'd probably use a NumPy array to represent the screen. I believe some of the functionality in NumPy would have made it easier to deal with the cases where the row of Pascal's triangle was either very short or very long (the "if" statements on rows 12 and 24). Regardless implementing it all with just str and list makes me appreciate SciPy libraries even more.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def gen_frame(row, filename, frame_dim, interpol):
    """Return an image for a given a row of pascal triangle."""
    frame = []

    # For each element in a row (represented as binary_strs) unpack it into a list of single 0's and 1's
    for binary_str in row:
        binary_str = ''.join(str(i) for i in binary_str if i.isdigit())
        binary_list = list(binary_str.zfill(SIZE[0]))
        binary_list = [int(i) for i in binary_list]

        # If the binary_list is longer then the output dimensions, trim off the LSBs
        if len(binary_list) > SIZE[0]:
            binary_list = binary_list[:SIZE[0]]

        # Append the binary_list of to the frame
        frame.extend(binary_list)

    # If the binary_list doesn't fill the frame than fill the blank space with 0's
    l_append = [0]*(floor((SIZE[0] - len(row))/2))*SIZE[1]
    r_append = [0]*(ceil((SIZE[0] - len(row))/2))*SIZE[1]
    canvas = l_append+frame+r_append

    # If the binary_list exceeds the size of the frame than trim off the edges
    if len(canvas) > (SIZE[0]*SIZE[1]):
        offset = (((len(frame))-(SIZE[0])*SIZE[1]))/2

        # Make sure the offset doesn't cause screen tearing
        if offset % SIZE[0] != 0:
            offset += SIZE[0]/2

        canvas = canvas[int(offset):]

    # Set image interpolation behaviour based on user input
    interpol = Image.LANCZOS if interpol else Image.NEAREST

    # Pack the frame into a byte and generate an image with it
    data = pack('B'*len(canvas), *[pixel*255 for pixel in canvas])
    img = Image.frombuffer('L', SIZE, data)
    img = img.rotate(-90)
    img = img.resize(frame_dim, interpol)
    img.save(filename)

Running the code above for x number of rows yields x images in a your working folder. The next function thankfully provides a context manager and makes a temp folder to work in. This makes it easier to keep track of the images we've generated and allows us to stitch them together into a gif easier.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def gen_gif(n_rows, frame_rate, frame_dim, interpol):
    """Generate a gif with n_rows number of frames and with frame timing of f_time.

    Keyword arguments:
    n_rows -- the number of rows in pascals triangle
    frame_rate -- the frame rate to generate the gif with
    frame_dim -- the dimensions to resize the frame to
    interpol -- flag to interpolate the image when upscaleing to frame_dim resolution
    """
    triangle = make_triangle(n_rows)  # Generate pascals triangle of n_rows

    # Make a temp folder and send all outputs to it
    temp_path = os.path.join(os.getcwd(), r'temp')
    if os.path.exists(temp_path):
        shutil.rmtree(temp_path)
        os.makedirs(temp_path)
    else:
        os.makedirs(temp_path)
    os.chdir(temp_path)

    # For each row in pascals triangle generate a frame based off it
    for idx, row in enumerate(triangle):
        gen_frame(row, "frame_{0}.png".format(idx), frame_dim, interpol)

    # Generate output gif given the files generated above
    with get_writer('pascals_triangle_{}.gif'.format(n_rows), mode='I', duration=frame_rate) as writer:
        filenames = [file for file in os.listdir() if file.endswith('.png')]

        # Sort files by numbers found in string containing numbers (ex. frame_6.png)
        def sortby(x):
            return int(search(r'\d+', x).group())

        filenames.sort(key=sortby)
        for filename in filenames:
            image = imread(filename)
            writer.append_data(image)

Another interesting issue I ran into was sorting the images in the temp folder. The images all follow the pattern "frame_x.png", where we want to sort by x but the png extension is required. If we scan the directory and use the sort method on our list with default arguments our elements are sorted as strings. This leaves us with results looking like:

1
['frame_0.png', 'frame_1.png', 'frame_10.png', 'frame_100.png', 'frame_101.png', 'frame_102.png', 'frame_103.png', 'frame_104.png', 'frame_105.png']

The sorting we see above obviously isn't what we want. To sort properly we can pass our own custom sortby function into the lists sort method. You can see our function on line 23 in the snippit above. This function uses regular expressions to scrape all of the ints from the input file title. The ints returned by the function are what is used by the sort method (ex. frame_10.png --> 10).

Everything gets wrapped together by the imageio library. We throw our images and some metadata at its get_writer class and then save it with our filename. Below is my best attempts to recreate the original gifs, aside from the noticeable difference in colour its good enough for me.

My best attempt at recreation
TL:DR - Made gifs using python. Check out full code on the Github repo