Topic Modeling with Non-negative Matrix Factorization(NMF)

Topic Modeling with Non-negative Matrix Factorization(NMF)

Natural language processing (NLP) is one of the trendier areas of data science. Its end applications are many — chatbots, recommender systems, search, virtual assistants, etc.

Natural language processing (NLP) is one of the trendier areas of data science. Its end applications are many — chatbots, recommender systems, search, virtual assistants, etc.

So it would be beneficial for budding data scientists to at least understand the basics of NLP even if their career takes them in a completely different direction. And who knows, some topics extracted through NLP might just give your next model that extra analytical boost. Today, in this post, we seek to understand why topic modeling is important and how it helps us as data scientists.

Topic modeling, just as it sounds, is using an algorithm to discover the topic or set of topics that best describes a given text document. You can think of each topic as a word or a set of words.

Topic modeling is the practice of using a quantitative algorithm to tease out the key topics that a body of the text is about. It bears a lot of similarities with something like PCA, which identifies the key quantitative trends (that explain the most variance) within your features. The outputs of PCA are a way of summarizing our features — for example, it allows us to go from something like 500 features to 10 summary features. These 10 summary features are basically topics.

NMF

Non-Negative Matrix Factorization (NMF) is an unsupervised technique so there is no labeling of topics that the model will be trained on. The way it works is that NMF decomposes (or factorizes) high-dimensional vectors into a lower-dimensional representation. These lower-dimensional vectors are non-negative which also means their coefficients are non-negative.

Using the original matrix (A), NMF will give you two matrices (W and H). W is the topics it found and H is the coefficients (weights) for those topics. In other words, A is articles by words (original), H is articles by topics and W is topics by words.

Image for post

Image for post

Getting The Data:

This is one of the most crucial steps in the process. As the old adage goes, ‘garbage in, garbage out’. When dealing with text as our features, it’s really critical to try and reduce the number of unique words (i.e. features) since there are going to be a lot. This is our first defense against too many features.

I searched far and wide for an exciting dataset and finally selected the 20 Newsgroups dataset. I’m just being sarcastic — I selected a dataset that is both easy to interpret and load in Scikit Learn. The dataset is easy to interpret because the 20 Newsgroups are known and the generated topics can be compared to the known topics being discussed. Headers, footers, and quotes are excluded from the dataset.

from sklearn.datasets import fetch_20newsgroups

Now that the text is processed we can use it to create features by turning them into numbers. There are a few different ways to do it. I use word count as features.

artificial-intelligence nlp classification

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

AI Innovations in Artificial Intelligence

Innovations in Artificial Intelligence - Various sectors in which AI is remarkably used & has brought changes in humanity - Education, Healthcare,automobile

Bursting the top 7 common Myths about Artificial Intelligence by Rebecca Harrison

Artificial Intelligence has been the go-to technology for companies and enterprises in recent years. The adoption of AI by enterprises all around the world has grown by 270% in the last four years a...

10 Most Amazing Artificial Intelligence Milestones To Know

Top 10 Artificial Intelligence Milestones to learn AI evolution - Origin,ELIZA,XCON,Statistics Introduction, Chess & jeopardy winner,autonomous vehicles

Top 10 Artificial Intelligence (AI) Interview Questions & Answers

In this Artificial Intelligence Interview Questions blog, you will understand the most frequently asked AI questions and their solutions.

8 Open-Source Tools To Start Your NLP Journey

Teaching machines to understand human context can be a daunting task. With the current evolving landscape, Natural Language Processing (NLP) has turned out to be an extraordinary breakthrough with its advancements in semantic and linguistic knowledge.NLP is vastly leveraged by businesses to build customised chatbots and voice assistants using its optical character and speed recognition