In this blog, we are analyzing the sentiment of the tweets that took place in the August 2016 Presidential debate in Ohio. We did both data categorization and content analysis to answer if the tweet was relevant, which candidate has mentioned most in the tweets and the sentiment of the tweet.

Read the data into R, the data is also available in the SQL database as well but here we have loaded the CSV file into R.

data=read.csv("Sentiment.csv") head(data)

Structuring the data is the most vital part of this process. The data set “Sentiment” have various other information that are not relevant therefor selecting specifically the “text” and “sentiment” column here and eliminating the rest.

library(tidyverse) datas=data%>%select(text,sentiment) head(datas) round(prop.table(table(datas$sentiment)),2)

Output after structuring the data:

Image for post

Image for post

Data Cleaning :

library(tm) library(SnowballC) corpus = VCorpus(VectorSource(datas$text)) corpus = tm_map(corpus, content_transformer(tolower)) corpus = tm_map(corpus, removeNumbers) corpus = tm_map(corpus, removePunctuation) corpus = tm_map(corpus, removeWords, stopwords("english")) corpus = tm_map(corpus, stemDocument) corpus = tm_map(corpus, stripWhitespace) as.character(corpus[[1]])

Output after cleaning the text:

Image for post

For counting the frequency of each word in the whole document we use another term known as document term matrix which makes our corpus function more numerical presentable.

dtm = DocumentTermMatrix(corpus) dtm dim(dtm) dtm = removeSparseTerms(dtm, 0.999) dim(dtm)

Below shows a word list from our text having a frequency of a minimum more than 100 times.

Image for post

Visualizing the text data by using a word cloud that gives a better insight into the top used word in each sentiment.

#naive-bayes #machine-learning #data-science #sentiment-analysis #data-visualization #data analysis

Tweet Analysis using Naive Bayes Algorithm in R
1.55 GEEK