Searching for Meaning in Trump’s Tweets.

Searching for Meaning in Trump’s Tweets.

It’s harder than you might think. An in-depth NLP analysis, using LDA, TSNE, Spacy, Gensim, and XX-Berts for good measure (although the latter is not really necessary).

Part one deals with some basic data preparation and very basic stats. In part two we are going to take a look at the word to vec (and Bert embeddings) transformation and address the use of vector similarity for clustering (spoiler — it doesn’t work and using KMeans on the result is misleading to say the least). Finally in part 3 we are going to apply LDA, combine it with the results of word to vec embeddings and attempt to identify main themes together with their graphical representation in good old 2D.

First thing is first, I am not trying to make any political statements here. It’s way too cliché these days. The real purpose behind this exercise is to look at possible applications of the clustering algorithm I described in my previous two articles (here and here). I was curious to see how it would fair in a really high dimensional space and with Spacey’s 300-feature and Bert family 768-feature vectors — NLP was really the obvious choice. It turns out it is not all about using the right clustering algo (although in itself it proved to be very instructive).

Trump’s tweets were relatively easy to obtain and have the advantage of being relatively short (being tweets — this is a given), fairly messy (plenty of abbreviations, links, hashtags etc.) and cover a range of subjects, as well as, having plenty of purely slogan/single uttering kind of messages (of which “CHINA!” is probably my all time favourite — reminds me of Father Jack’s DRINK! punctuating some of the best Father Ted episodes). In short — it is a real challenge, especially if we start looking at topic recognition.

Data preparation

Let’s get the imports out of the way:

We are going to need the standard plotting pack, pandas and numpy (naturally), also going to use some regex, time module.

%matplotlib inline
import matplotlib.pylab as plt
import seaborn as sns

import pandas as pd
pd.options.display.max_columns = 999
import numpy as np
import datetime
import re

I will mostly use spacy for cleaning, but for whatever reason I had nltk stopwords on hand so I used a bit of nltk as well:

from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
tokenizer = RegexpTokenizer(r'\w+')
import spacy
import en_core_web_lg
nlp = en_core_web_lg.load()

Could probably get away with the smaller version of spacy’s vocabulary, but using the large version just because we can.

For the time being I am leaving out pretrained Bert vectors.

The dataset I am using lives here: http://www.trumptwitterarchive.com/archive

It can be a bit tricky to load, I ended up using semicolon as a delimiter and using this load it in. I am using 4 years worth of tweets — up to and including May of this year. No particular reason for truncating it there, of course.

tr = pd.read_csv(
    'trump_tweet.csv',
    delimiter=';',
    engine='python',
    encoding='utf-8-sig', 
    quotechar='"', 
    escapechar='\\', 
    error_bad_lines=False
)

Quick peek:

Image for post

Before we even start cleaning and preparing the dataset for NLP analysis, we can take a quick look at some of the trends, mentions etc. In part, it’s good to do it early because we would want to get rid of retweets for the NLP part. Arguably, this will not limit us to the tweets written by Mr Trump alone. There are still going to be multiple tweets sent by his staff and plenty of promotional automated tweets sent both from handheld devices and from web management suits. Still I felt like retweets can be verbose enough to muddle the picture significantly and we know he is not the author, so out they go. I will stick with the clean up here and will come back to the pre-clean dataset to look at the accounts he regularly retweets in the next section.

All RTs have zero favourite count as those are kept with the original, so I am going to use that to filter. You can also use regex or str module to find everything that starts with RT as well. I also restrict tweets to the ones sent from Android and from iPhone.

As a little side note — I have seen some articles suggesting that Trump was using Android as a personal phone and that iPhone was being used by his staff. This has changed in 2017, now everything is sent from iPhone. Comparing the two streams I don’t see a strong evidence to suggest that the nature of tweets from one device is significantly different to the ones from the other, so I am ignoring this suggestion here.

nlp data-science machine-learning twitter trump deep learning

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Most popular Data Science and Machine Learning courses — July 2020

Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant

How To Build A Data Science Career In 2021

In Conversation With Dr Suman Sanyal, NIIT University,he shares his insights on how universities can contribute to this highly promising sector and what aspirants can do to build a successful data science career.

PyTorch for Deep Learning | Data Science | Machine Learning | Python

PyTorch for Deep Learning | Data Science | Machine Learning | Python. PyTorch is a library in Python which provides tools to build deep learning models. What python does for programming PyTorch does for deep learning. Python is a very flexible language for programming and just like python, the PyTorch library provides flexible tools for deep learning.

Machine Learning And Deep Learning Interview Questions For Data Science Interview

This video will help you get an idea about the top machine learning and deep learning interview questions that are crucial to crack any data science interview. We have included conceptual, theoretical and practical questions on machine learning and deep learning techniques. Let’s begin!

Deep Learning vs Machine Learning vs Artificial Intelligence vs Data Science

This "Deep Learning vs Machine Learning vs AI vs Data Science" video talks about the differences and relationship between Artificial Intelligence, Machine Learning, Deep Learning, and Data Science.