In my previous articles, we learned how to scrapeprocess, and analyze employee reviews from Indeed.com. Feel free to take a look and offer feedback. I would love to hear how you would improve the code. In particular, how to dynamically overcome changes to the website’s HTML.

In this article, I would like to take our dataset a step further to solve a sentiment classification problem. Specifically, we will be assigning sentiment targets to each review and then using a binary classification algorithm to predict those targets.

We’ll be importing raw employee reviews which we scraped from Indeed.com. I would suggest you review my previous article to understand how we were able to obtain this dataset.

First, let’s import our dataset and get processing. We only need the ‘rating’ and ‘rating_description’ columns for this analysis. For a more detailed explanation of text pre-processing head over to my nlp preprocessing article.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
import contractions
import random
import fasttext
from autocorrect import spell
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords, wordnet
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from nltk.stem import WordNetLemmatizer
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from imblearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import auc
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)
pd.set_option('display.max_colwidth', 200)
with open('apple_scrape.csv') as f:
    df = pd.read_csv(f)
f.close()
print(df.head())
df = df[['rating', 'rating_description']]

#classification #machine-learning #sentiment-analysis #gradient-boosting #nltk #deep learning

Predicting Sentiment of Employee Reviews
1.15 GEEK