Preamble
I've always been a horrible speller (you probably know this by now if you're reading this blog). When I was in grade school, English was always my lowest mark, and I've never enjoyed reading fiction. As a consequence of this, I'm mystified by the intersections of Data Science and Written Language, known as Natural Language Processing (
NLP). Knowing that there are chatbots that communicate at a higher grade level than myself, I figure that I should understand how they work.
A while back, I heard a theory that
Shakespeare might have been a pen name for multiple authors and that NLP could be a tool used to validate this theory. The specific branch of NLP related to this topic is Stylometry, a technique used to analyze writing styles. I figured this could be an engaging starting point.
I figured presidential speeches would be an interesting dataset as it contained a variety of authors, spanned a long time range, and included a wide breadth of subject matter. Finding a complete set of speeches was next to impossible, so I reached out to
The Miller Center, a nonpartisan affiliate of the University of Virginia that specializes in presidential scholarship and political history. I received a response incredibly quickly (thanks, Miles J.) and got to work on applying Stylometry to the dataset.
Preprocessing and Exploration
To start, we need to load our data. Since the file is JSON, each row is a key-value pair that we need to expand. While we're at it, we can parse our dates and create a unique identifier for each speech; this will come in handy later.
cols = ["title",
"intro",
"president",
"date",
"text"]
# Expand dictionary
df[cols] = df.speeches.apply(pd.Series)
# Drop unexpanded dictionary
df = df.drop('speeches', axis=1)
# Parse Datetime and expand
df.loc[:,"date"] = pd.to_datetime(df.date)
df.loc[:,"year"] = pd.DatetimeIndex(df.date).year
df.loc[:,"month"] = pd.DatetimeIndex(df.date).month
df.loc[:,"day"] = pd.DatetimeIndex(df.date).day
df.loc[:,"decade"] = divmod(df["year"], 10)[0] * 10
# Create Unique ID column to use for joining
df["uid"] = pd.util.hash_pandas_object(df)
I decided to join the data with another data source. I had a hunch that the political party of the president may be an inciteful feature to aggregate on. This feature may not come in handy, but getting the mappings isn't too bad, as you can found them on
Wikipedia.
I'm not an expert in US politics, but the figure to the left seems to line up with my intuition. Visualizing the number of total speeches by party, we can see that the majority of our data fall under either the Democratic or Republican parties. Since these are the two major parties of the modern era, this result makes sense.
Looking at speeches by decade over time, we can see another trend. It seems that as time progresses, so does the frequency of speeches. Again this isn't surprising.
Stylometry
Disclaimer: I'm not an expert in Stylometry; in fact, before starting this article, I didn't even know what it was called. Please take everything I say with caution.
The speeches in our dataset come in a variety of different formats and require a fair amount of preprocessing before we can make comparisons. Some of the data contain HTML formatting, speaker labels or other annotations. To work around this, we trim all punctuation and
stop words (uninteresting words such as "and," "or," "so") from the text using the function
seriesToCorpus below.
from nltk.tokenize import word_tokenize # For NLP word tokenizing speeches
from nltk.corpus import stopwords # For filtering out boring words
import string # For filtering out punctuation
def seriesToCorpus(input_text, LOWER=True, TRIM_PUNCT=True):
"""Remove stopwords and punctuation from text to produce tokens
Keyword arguments:
input_text -- list or series of lists containing free text speeches
LOWER -- flag to set all tokens to lower-case (Default: True)
TRIM_PUNCT -- flag to remove all punctuation from tokens (Default: True)
"""
# Allow for both Series and Cell inputs to insure consistent processing
if type(input_text) == pd.core.series.Series:
input_text = input_text.to_list()
else:
input_text = [input_text]
# Load common english stop words (and, an, the, who)
stop_words = set(stopwords.words('english'))
# If flag is set make all tokens lower-case
if LOWER:
tok_speech = [word_tokenize(speech.lower()) for speech in input_text]
else:
tok_speech = [word_tokenize(speech) for speech in input_text]
output = list()
for speech in tok_speech:
output.extend(speech)
# If flag is set trim all punctuation
if TRIM_PUNCT:
punct = list(string.punctuation)+["--","'s","’","\'\'","``"]
output = [w for w in output if not w in punct]
output = [w for w in corpus if not w in stop_words]
return output
corpus = seriesToCorpus(df.text)
df['tokenized'] = df.apply(lambda x: seriesToCorpus(x.text), axis=1)
This function
seriesToCorpus provides us with a way to tokenize our text consistently. These tokens are what we will use later on for further analysis. Without a comprehensive method to tokenize text, we will end up with garbage text making its way to our output and diluting our results. For example, if we did not trim off stop words, we would end up using conjunctions as the bases for our comparison in the next few steps.
If we are to combine all of our tokenized speeches, then we end up with a big list of words that we can use for comparisons. We'll refer to this as the corpus.
|
Tokenizing speeches and converted to a joint sorted corpus |
With our corpus, we can scan for the top one hundred most frequently occurring words and use that to make comparisons. Since the corpus, by definition, contains every speech in our dataset, the top terms are an ideal way to make comparisons.
With the top words, we can make fingerprints for each speech. We do so by taking each of our most frequently occurring results from the corpus and calculating the frequency it occurs for a given speech. For our top hundred words, this gives us a 1 by 100 fingerprint for each speech. These tasks are accomplished by the functions
getTopFreqWords and
getFingerprint below.
from nltk.probability import FreqDist # For Frequency Distributions
def getTopFreqWords(corpus, n):
"""Return the n most frequently occurring words from a body of text"""
fdist = FreqDist(corpus)
return [x[0] for x in fdist.most_common(n)]
def getFingerprint(text, topWordsList):
"""Return a list of frequencies for each string in a list of strings"""
output = list()
for text_key in topWordsList:
output.append(FreqDist(text).freq(text_key))
return list(output)
top_hundred = getTopFreqWords(corpus, 100)
df['fingerprint'] = df.apply(lambda x: getFingerprint(x.tokenized, top_hundred), axis=1)
Double Checking (Optional)
At this point, we've made a lot of changes to our dataset. Each operation introduces the potential for bugs that can skew our results. To account for this, we can reference back to our UID column to ensure that the changes we made to our dataset went as planned. Calling the sum of the UID column before and after the transformations above produced the same result. Obtaining the same sum means that no columns have been inserted or removed, only modified as we planned.
print(df.uid.sum())
# Returns 8702334536768193124
# Data processing goes here
print(df.uid.sum())
# Returns 8702334536768193124
Wrapping things up
With the fingerprints we just made, we now have a numeric representation of every speech. This numeric array should seem a bit more familiar and lends itself a bit better to being used for further data processing and comparisons (
Clustering,
PCA and other analysis). In the next
articles, we will use these fingerprints to compare the speech styles of parties, presidents and eras.
TLDR: Calculated the ratio of top word occurrences for a bunch of presidential speeches. Next, we'll make pretty graphs with it.