End-To-End Topic Modeling in Python: Latent Dirichlet Allocation (LDA)

<strong>Topic Model: In a nutshell, it is a type of statistical model used for tagging abstract “topics” that occur in a collection of documents that best represents the information in them. Many techniques are used to obtain topic models. This post aims to demonstrate the implementation of LDA: a widely used topic modeling technique.</strong>

Topic Model: In a nutshell, it is a type of statistical model used for tagging abstract “topics” that occur in a collection of documents that best represents the information in them. Many techniques are used to obtain topic models. This post aims to demonstrate the implementation of LDA: a widely used topic modeling technique.

Latent Dirichlet Allocation (LDA)

By definition, LDA is a generative probabilistic model for a given corpus. The basic idea is that documents are represented as a random mixture of latent topics, where each topic is characterized by a distribution of words.

LDA algorithm under the hood

Given the M number of documents, N number of words, and estimated K topics, LDA uses the information to output (1) K number of topics, (2) psi, which is words distribution for each topic K, and (3) phi, which is topic distribution for document i

Alpha parameter is Dirichlet prior concentration parameter that represents document-topic density — with a higher alpha, documents are assumed to be made up of more topics and result in more specific topic distribution per document.

Beta parameter is the same prior concentration parameter that represents topic-word density — with high beta, topics are assumed to made of up most of the words and result in a more specific word distribution per topic.

LDA Implementation

The complete code is available as a Jupyter Notebook on GitHub

  1. Loading data
  2. Data cleaning
  3. Exploratory analysis
  4. Preparing data for LDA analysis
  5. LDA model training
  6. Analyzing LDA model results

Loading data

For this tutorial, we’ll use dataset of papers published in NIPS conference. The NIPS conference (Neural Information Processing Systems) is one of the most prestigious yearly events in the machine learning community. At each NIPS conference, a large number of research papers are published. The CSV file contains information on the different NIPS papers that were published from 1987 until 2016 (29 years!). These papers discuss a wide variety of topics in machine learning, from neural networks to optimization methods and many more.

First, we will explore the CSV file to determine what type of data we can use for the analysis and how it is structured. A research paper typically consists of a title, an abstract and the main text.

In[1]: 
# Importing modules
import pandas as pd
import os
os.chdir('..')
# Read data into papers
papers = pd.read_csv('./data/NIPS Papers/papers.csv')
# Print head
papers.head()


Data Cleaning

Drop Redundant Columns

For the analysis of the papers, we are only interested in the text data associated with the paper as well as the year the paper was published in. Since the file contains some metadata such as id’s and filenames, it is necessary to remove all the columns that do not contain useful text information.

In[2]:
# Remove the columns
papers = papers.drop(columns=['id', 'event_type', 'pdf_name'], axis=1)
# Print out the first rows of papers
papers.head()

Remove punctuation/lower casing

Now, we will perform some simple preprocessing on the paper_text in order to make them more amenable for analysis. We will use a regular expression to remove any punctuation in the title. Then we will perform lowercasing.

# Load the regular expression library
import re
# Remove punctuation
papers['paper_text_processed'] = papers['paper_text'].map(lambda x: re.sub('[,\.!?]', '', x))
# Convert the titles to lowercase
papers['paper_text_processed'] = papers['paper_text_processed'].map(lambda x: x.lower())
# Print out the first rows of papers
papers['paper_text_processed'].head()

Exploratory Analysis

In order to verify whether the preprocessing happened correctly, we can make a word cloud of the text of the research papers. This will give us a visual representation of the most common words. Visualization is key to understanding whether we are still on the right track! In addition, it allows us to verify whether we need additional preprocessing before further analyzing the text data.

Python has a massive number of open libraries! Instead of trying to develop a method to create word clouds ourselves, we’ll use Andreas Mueller’s wordcloud library

# Import the wordcloud library
from wordcloud import WordCloud
# Join the different processed titles together.
long_string = ','.join(list(papers['paper_text_processed'].values))
# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color='steelblue')
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()

Prepare text for LDA Analysis

LDA does not work directly on text data. First, it is necessary to convert the documents into a simple vector representation. This representation will then be used by LDA to determine the topics. Each entry of a ‘document vector’ will correspond with the number of times a word occurred in the document (Bag-of-Words BOW Representation).

Next, we will convert a list of titles into a list of vectors, all with length equal to the vocabulary.

We’ll then plot the 10 most common words based on the outcome of this operation (the list of document vectors). As a check, these words should also occur in the word cloud.

# Load the library with the CountVectorizer method
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
# Helper function
def plot_10_most_common_words(count_data, count_vectorizer):
    import matplotlib.pyplot as plt
    words = count_vectorizer.get_feature_names()
    total_counts = np.zeros(len(words))
    for t in count_data:
        total_counts+=t.toarray()[0]
count_dict = (zip(words, total_counts))
count_dict = sorted(count_dict, key=lambda x:x[1], reverse=True)[0:10]
words = [w[0] for w in count_dict]
counts = [w[1] for w in count_dict]
x_pos = np.arange(len(words)) 

plt.figure(2, figsize=(15, 15/1.6180))
plt.subplot(title='10 most common words')
sns.set_context("notebook", font_scale=1.25, rc={"lines.linewidth": 2.5})
sns.barplot(x_pos, counts, palette='husl')
plt.xticks(x_pos, words, rotation=90) 
plt.xlabel('words')
plt.ylabel('counts')
plt.show()
Initialise the count vectorizer with the English stop words

count_vectorizer = CountVectorizer(stop_words='english')

Fit and transform the processed titles

count_data = count_vectorizer.fit_transform(papers['paper_text_processed'])

Visualise the 10 most common words

plot_10_most_common_words(count_data, count_vectorizer)

LDA model training and results visualization

The only parameter we will tweak is the number of topics in the LDA algorithm. Typically, one would calculate the ‘perplexity’ metric to determine which number of topics is best and iterate over different amounts of topics until the lowest ‘perplexity’ is found. We’ll cover the model evaluation and tuning concept along with exploring Gensim, widely used natural language processing toolkit, in the next article.

import warnings
warnings.simplefilter("ignore", DeprecationWarning)

Load the LDA model from sk-learn

from sklearn.decomposition import LatentDirichletAllocation as LDA

Helper function

def print_topics(model, count_vectorizer, n_top_words):
words = count_vectorizer.get_feature_names()
for topic_idx, topic in enumerate(model.components_):
print("\nTopic #%d:" % topic_idx)
print(" ".join([words[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))

Tweak the two parameters below

number_topics = 5
number_words = 10

Create and fit the LDA model

lda = LDA(n_components=number_topics)
lda.fit(count_data)

Print the topics found by the LDA model

print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)

Analyzing LDA model results

pyLDAvis package is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The interactive visualization pyLDAvis produces is helpful for both:

  1. Better understanding and interpreting individual topics, and
  2. Better understanding the relationships between the topics.

For (1), you can manually select each topic to view its top most freqeuent and/or “relevant” terms, using different values of the λ parameter. This can help when you’re trying to assign a human interpretable name or “meaning” to each topic.

For (2), exploring the Intertopic Distance Plot can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

%%time
from pyLDAvis import sklearn as sklearn_lda
import pickle
import pyLDAvis
LDAvis_data_filepath = os.path.join('./ldavis_prepared_'+str(number_topics))

# this is a bit time consuming - make the if statement True # if you want to execute visualization prep yourself

if 1 == 1:
LDAvis_prepared = sklearn_lda.prepare(lda, count_data, count_vectorizer)
with open(LDAvis_data_filepath, 'w') as f:
pickle.dump(LDAvis_prepared, f)

load the pre-prepared pyLDAvis data from disk

with open(LDAvis_data_filepath) as f:
LDAvis_prepared = pickle.load(f)
pyLDAvis.save_html(LDAvis_prepared, './ldavis_prepared_'+ str(number_topics) +'.html')

Machine learning has become increasingly popular over the past decade, and recent advances in computational availability have led to exponential growth to people looking for ways how new methods can be incorporated to advance the field of Natural Language Processing. Often, we treat topic models as black-box algorithms, but hopefully, this post addressed to shed light on the underlying math, and intuitions behind it, and high-level code to get you started with any textual data.

As described above, in the next article, we’ll go one step deeper into understanding one can evaluate the performance of topic models, tune the hyper-parameters to get it to the point that could be deployed into production.

Originally published by Shashank Kapadia at https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0

Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

☞ Machine Learning with Python, Jupyter, KSQL and TensorFlow

☞ Python and HDFS for Machine Learning

☞ Applied Deep Learning with PyTorch - Full Course

☞ Tkinter Python Tutorial | Python GUI Programming Using Tkinter Tutorial | Python Training

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Python for Data Science and Machine Learning Bootcamp

☞ Data Science, Deep Learning, & Machine Learning with Python

☞ Deep Learning A-Z™: Hands-On Artificial Neural Networks

☞ Artificial Intelligence A-Z™: Learn How To Build An AI

Machine Learning, Data Science and Deep Learning with Python

Machine Learning, Data Science and Deep Learning with Python

Complete hands-on Machine Learning tutorial with Data Science, Tensorflow, Artificial Intelligence, and Neural Networks. Introducing Tensorflow, Using Tensorflow, Introducing Keras, Using Keras, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Learning Deep Learning, Machine Learning with Neural Networks, Deep Learning Tutorial with Python

Machine Learning, Data Science and Deep Learning with Python

Complete hands-on Machine Learning tutorial with Data Science, Tensorflow, Artificial Intelligence, and Neural Networks

Explore the full course on Udemy (special discount included in the link): http://learnstartup.net/p/BkS5nEmZg

In less than 3 hours, you can understand the theory behind modern artificial intelligence, and apply it with several hands-on examples. This is machine learning on steroids! Find out why everyone’s so excited about it and how it really works – and what modern AI can and cannot really do.

In this course, we will cover:
• Deep Learning Pre-requistes (gradient descent, autodiff, softmax)
• The History of Artificial Neural Networks
• Deep Learning in the Tensorflow Playground
• Deep Learning Details
• Introducing Tensorflow
• Using Tensorflow
• Introducing Keras
• Using Keras to Predict Political Parties
• Convolutional Neural Networks (CNNs)
• Using CNNs for Handwriting Recognition
• Recurrent Neural Networks (RNNs)
• Using a RNN for Sentiment Analysis
• The Ethics of Deep Learning
• Learning More about Deep Learning

At the end, you will have a final challenge to create your own deep learning / machine learning system to predict whether real mammogram results are benign or malignant, using your own artificial neural network you have learned to code from scratch with Python.

Separate the reality of modern AI from the hype – by learning about deep learning, well, deeply. You will need some familiarity with Python and linear algebra to follow along, but if you have that experience, you will find that neural networks are not as complicated as they sound. And how they actually work is quite elegant!

This is hands-on tutorial with real code you can download, study, and run yourself.

Python Tutorial - Learn Python for Machine Learning and Web Development

Python Tutorial - Learn Python for Machine Learning and Web Development

Python tutorial for beginners - Learn Python for Machine Learning and Web Development. Can Python be used for machine learning? Python is widely considered as the preferred language for teaching and learning ML (Machine Learning). Can I use Python for web development? Python can be used to build server-side web applications. Why Python is suitable for machine learning? How Python is used in AI? What language is best for machine learning?

Python tutorial for beginners - Learn Python for Machine Learning and Web Development

TABLE OF CONTENT

  • 00:00:00 Introduction
  • 00:01:49 Installing Python 3
  • 00:06:10 Your First Python Program
  • 00:08:11 How Python Code Gets Executed
  • 00:11:24 How Long It Takes To Learn Python
  • 00:13:03 Variables
  • 00:18:21 Receiving Input
  • 00:22:16 Python Cheat Sheet
  • 00:22:46 Type Conversion
  • 00:29:31 Strings
  • 00:37:36 Formatted Strings
  • 00:40:50 String Methods
  • 00:48:33 Arithmetic Operations
  • 00:51:33 Operator Precedence
  • 00:55:04 Math Functions
  • 00:58:17 If Statements
  • 01:06:32 Logical Operators
  • 01:11:25 Comparison Operators
  • 01:16:17 Weight Converter Program
  • 01:20:43 While Loops
  • 01:24:07 Building a Guessing Game
  • 01:30:51 Building the Car Game
  • 01:41:48 For Loops
  • 01:47:46 Nested Loops
  • 01:55:50 Lists
  • 02:01:45 2D Lists
  • 02:05:11 My Complete Python Course
  • 02:06:00 List Methods
  • 02:13:25 Tuples
  • 02:15:34 Unpacking
  • 02:18:21 Dictionaries
  • 02:26:21 Emoji Converter
  • 02:30:31 Functions
  • 02:35:21 Parameters
  • 02:39:24 Keyword Arguments
  • 02:44:45 Return Statement
  • 02:48:55 Creating a Reusable Function
  • 02:53:42 Exceptions
  • 02:59:14 Comments
  • 03:01:46 Classes
  • 03:07:46 Constructors
  • 03:14:41 Inheritance
  • 03:19:33 Modules
  • 03:30:12 Packages
  • 03:36:22 Generating Random Values
  • 03:44:37 Working with Directories
  • 03:50:47 Pypi and Pip
  • 03:55:34 Project 1: Automation with Python
  • 04:10:22 Project 2: Machine Learning with Python
  • 04:58:37 Project 3: Building a Website with Django

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading

Complete Python Bootcamp: Go from zero to hero in Python 3

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python and Django Full Stack Web Developer Bootcamp

Complete Python Masterclass

Python Programming Tutorial | Full Python Course for Beginners 2019 👍

Top 10 Python Frameworks for Web Development In 2019

Python for Financial Analysis and Algorithmic Trading

Building A Concurrent Web Scraper With Python and Selenium

Machine Learning Full Course - Learn Machine Learning

Machine Learning Full Course - Learn Machine Learning

This complete Machine Learning full course video covers all the topics that you need to know to become a master in the field of Machine Learning.

Machine Learning Full Course | Learn Machine Learning | Machine Learning Tutorial

It covers all the basics of Machine Learning (01:46), the different types of Machine Learning (18:32), and the various applications of Machine Learning used in different industries (04:54:48).This video will help you learn different Machine Learning algorithms in Python. Linear Regression, Logistic Regression (23:38), K Means Clustering (01:26:20), Decision Tree (02:15:15), and Support Vector Machines (03:48:31) are some of the important algorithms you will understand with a hands-on demo. Finally, you will see the essential skills required to become a Machine Learning Engineer (04:59:46) and come across a few important Machine Learning interview questions (05:09:03). Now, let's get started with Machine Learning.

Below topics are explained in this Machine Learning course for beginners:

  1. Basics of Machine Learning - 01:46

  2. Why Machine Learning - 09:18

  3. What is Machine Learning - 13:25

  4. Types of Machine Learning - 18:32

  5. Supervised Learning - 18:44

  6. Reinforcement Learning - 21:06

  7. Supervised VS Unsupervised - 22:26

  8. Linear Regression - 23:38

  9. Introduction to Machine Learning - 25:08

  10. Application of Linear Regression - 26:40

  11. Understanding Linear Regression - 27:19

  12. Regression Equation - 28:00

  13. Multiple Linear Regression - 35:57

  14. Logistic Regression - 55:45

  15. What is Logistic Regression - 56:04

  16. What is Linear Regression - 59:35

  17. Comparing Linear & Logistic Regression - 01:05:28

  18. What is K-Means Clustering - 01:26:20

  19. How does K-Means Clustering work - 01:38:00

  20. What is Decision Tree - 02:15:15

  21. How does Decision Tree work - 02:25:15 

  22. Random Forest Tutorial - 02:39:56

  23. Why Random Forest - 02:41:52

  24. What is Random Forest - 02:43:21

  25. How does Decision Tree work- 02:52:02

  26. K-Nearest Neighbors Algorithm Tutorial - 03:22:02

  27. Why KNN - 03:24:11

  28. What is KNN - 03:24:24

  29. How do we choose 'K' - 03:25:38

  30. When do we use KNN - 03:27:37

  31. Applications of Support Vector Machine - 03:48:31

  32. Why Support Vector Machine - 03:48:55

  33. What Support Vector Machine - 03:50:34

  34. Advantages of Support Vector Machine - 03:54:54

  35. What is Naive Bayes - 04:13:06

  36. Where is Naive Bayes used - 04:17:45

  37. Top 10 Application of Machine Learning - 04:54:48

  38. How to become a Machine Learning Engineer - 04:59:46

  39. Machine Learning Interview Questions - 05:09:03