Twitter allows collecting tweets using tweepy, a Python library for accesing the Twitter API. Here I don’t intend to give a tutorial on how to collect tweets, as there are already several good reviews (take a look at the references below), but rather to give you a full example of how to process the tweet objects in order to build a clean dataframe on which we can perform social media analyses.
**TL;TR: **Along the way, we will flatten the Twitter JSON, select the text objects among the several options (main tweet, re-tweet, quote, etc.), clean them (remove non-alphabetic characters), translate non-English tweets, compute the sentiment of the text, and associate a location given a user-defined location or the automatic geolocalization.
**Libraries to use: **pandas, country converter, GeoPy, spaCy, googletrans, NLTK.
Each tweet object comes in JSON format, a mix of ‘root-level’ attributes, and child objects (which are represented with the _{}_
notation). The Twitter developer page gives the following example:
{
"created_at": "Wed Oct 10 20:19:24 +0000 2018",
"id": 1050118621198921728,
"id_str": "1050118621198921728",
"text": "To make room for more expression, we will now count all emojis as equal—including those with gender and skin t… https://t.co/MkGjXf9aXm",
"user": {},
"entities": {}
}
This is of course a small sample out of the huge dictionary composing each tweet. Another popular example is this Twitter Status Object map.
For most kinds of analyses, we will surely need attributes as the tweet text, the user screen-name or the tweet place. Unfortunately, as you can see, these attributes don’t come in a clean format, instead they are spread across the JSON levels — e.g., the tweet location coordinates are located in
tweet_object['place']['bounding_box']['coordinates']
It is due to this fact that the collected tweets need a large process of cleaning and transforming, which is the purpose of this post.
Irecently carried out a language localization project where I needed to do a social media analysis on Twitter. For this, I collected 52830 tweets over the course of several days containing the following keywords: ‘#FIFA20’, ‘#FIFA21’, ‘FIFA20’, ‘FIFA21’, ‘FIFA 20’, ‘FIFA 21’ and ‘#EASPORTSFIFA’. Then, in order to do a proper analysis on them, I had to previously clean each tweet object so I could draw meaningful conclusion.
Due to the nature of that project, I was mainly interested in data regarding the location of the tweet (country and coordinates), the sentiment of the English version of the text, and the language the text was tweeted in. And it was the goal of the processing steps to polish and find this attributes. You can find the details of the project in the following repository:
Let’s use this dataset to exemplify the steps of tweets processing!
As we saw, there are multiple fields in the Twitter JSON which contains textual data. In a typical tweet, there’s the tweet text, the user description, and the user location. In a tweet longer than 140 characters, there’s also the extended tweet child JSON. And in a quoted tweet, there’s the original tweet text and the commentary with the quoted tweet.
To analyze tweets at scale, we will want to flatten the tweet JSON into a single level. This will allow us to store the tweets in a DataFrame format. To do this, we will define the function flatten_tweets()
which will take several fields regarding text and location (this one stored in place
). Take a look:
def flatten_tweets(tweets):
""" Flattens out tweet dictionaries so relevant JSON is
in a top-level dictionary. """
tweets_list = []
# Iterate through each tweet
for tweet_obj in tweets:
''' User info'''
# Store the user screen name in 'user-screen_name'
tweet_obj['user-screen_name'] = tweet_obj['user']['screen_name']
# Store the user location
tweet_obj['user-location'] = tweet_obj['user']['location']
''' Text info'''
# Check if this is a 140+ character tweet
if 'extended_tweet' in tweet_obj:
# Store the extended tweet text in 'extended_tweet-full_text'
tweet_obj['extended_tweet-full_text'] = \
tweet_obj['extended_tweet']['full_text']
if 'retweeted_status' in tweet_obj:
# Store the retweet user screen name in
# 'retweeted_status-user-screen_name'
tweet_obj['retweeted_status-user-screen_name'] = \
tweet_obj['retweeted_status']['user']['screen_name']
# Store the retweet text in 'retweeted_status-text'
tweet_obj['retweeted_status-text'] = \
tweet_obj['retweeted_status']['text']
if 'extended_tweet' in tweet_obj['retweeted_status']:
# Store the extended retweet text in
#'retweeted_status-extended_tweet-full_text'
tweet_obj['retweeted_status-extended_tweet-full_text'] = \
tweet_obj['retweeted_status']['extended_tweet']['full_text']
if 'quoted_status' in tweet_obj:
# Store the retweet user screen name in
#'retweeted_status-user-screen_name'
tweet_obj['quoted_status-user-screen_name'] = \
tweet_obj['quoted_status']['user']['screen_name']
# Store the retweet text in 'retweeted_status-text'
tweet_obj['quoted_status-text'] = \
tweet_obj['quoted_status']['text']
if 'extended_tweet' in tweet_obj['quoted_status']:
# Store the extended retweet text in
#'retweeted_status-extended_tweet-full_text'
tweet_obj['quoted_status-extended_tweet-full_text'] = \
tweet_obj['quoted_status']['extended_tweet']['full_text']
''' Place info'''
if 'place' in tweet_obj:
# Store the country code in 'place-country_code'
try:
tweet_obj['place-country'] = \
tweet_obj['place']['country']
tweet_obj['place-country_code'] = \
tweet_obj['place']['country_code']
tweet_obj['location-coordinates'] = \
tweet_obj['place']['bounding_box']['coordinates']
except: pass
tweets_list.append(tweet_obj)
return tweets_list
view raw
flatten_tweets.py hosted with ❤ by GitHub
Now, you may want to study all the text fields (main, re-tweet or quote), however, here I will keep just one text field for simplicity. For this, we now define a function select_text(tweets)
that selects the main text whether the tweet is a principal tweet or a re-tweet, and we decide to drop the quoted text as it usually is repetitive and may not be informative.
def select_text(tweets):
''' Assigns the main text to only one column depending
on whether the tweet is a RT/quote or not'''
tweets_list = []
# Iterate through each tweet
for tweet_obj in tweets:
if 'retweeted_status-extended_tweet-full_text' in tweet_obj:
tweet_obj['text'] = \
tweet_obj['retweeted_status-extended_tweet-full_text']
elif 'retweeted_status-text' in tweet_obj:
tweet_obj['text'] = tweet_obj['retweeted_status-text']
elif 'extended_tweet-full_text' in tweet_obj:
tweet_obj['text'] = tweet_obj['extended_tweet-full_text']
tweets_list.append(tweet_obj)
return tweets_list
view raw
select_text.py hosted with ❤ by GitHub
We now build the data frame. Notice that we choose the main columns (fields) relevant for a social media analysis. This includes the tweet language, lang
, and the user-location
, which is set manually by the user. We also keep the country
, country_code
and coordinates
fields from place
. These fields appear when the tweet is geo-tagged and it is usually contained in less than the 10% of the total of tweets. The following code block builds the dataframe:
import pandas as pd
# flatten tweets
tweets = flatten_tweets(tweets_data)
# select text
tweets = select_text(tweets)
columns = ['text', 'lang', 'user-location', 'place-country',
'place-country_code', 'location-coordinates',
'user-screen_name']
# Create a DataFrame from `tweets`
df_tweets = pd.DataFrame(tweets, columns=columns)
# replaces NaNs by Nones
df_tweets.where(pd.notnull(df_tweets), None, inplace=True)
#twitter #sentiment-analysis #nlp #data-science #social-media-analysis