How to Use Latent Dirichlet Allocation (LDA)?

In here, I won’t cover the mathematical foundation of the LDA. I will just discuss how to interpret the results of LDA topic modeling.

During LDA Topic modeling, we create many different topic groups. As a researcher, we are the one who decides the number of groups in the output. However, we do not know what is the best number of groups. Therefore, we will obtain different numbers of groups. Then, we examine and compare topic modelings, and decide which topic model makes more sense, most meaningful, have the clearest distinction within the model. Then, the group (model) that makes the most sense will be chosen among all topic groups.

It must be noted that the nature of the LDA is subjective. It’s possible that different people may reach different conclusions about choosing the most meaningful topic groups. We are looking for the most reasonable topic groups. People from different backgrounds, different domain expertise may not be on the same page about most sensible topic groups.

LDA is an unsupervised clustering method. When we are talking about unsupervised clustering methods, we need to mention K-Means Clustering. K-Means Clustering is one of the most well-known unsupervised clustering methods. It’s very practical and useful in many cases and has used for text mining for years. In contrast to K-means clustering, where each word can only belong to one cluster (hard-clustering), LDA allows for ‘fuzzy’ memberships (soft clustering). Soft-clustering allows overlap among the cluster, whereas, in hard-clustering, clusters are mutually exclusive. What does this mean? In LDA, a word can belong to more than one group while this is not possible in K-Means clustering. In LDA, this trade-off makes it easier to find similarities between the words. However, this soft-clustering makes it hard to obtain distinct groups because the same words can appear in different groups. We will experience this drawback in our further analysis.

Once the topic modeling technique is applied, the researcher’s job as a human is to interpret the results and see if the mix of words in each topic makes sense. If they don’t make sense, we can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.

Preparation of the data

**Brief information about the data I use: **The data used in this study were downloaded from Kaggle. It was uploaded by the Stanford Network Analysis Project. The original data is coming from the study of ‘From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews’ done by J. McAuley and J. Leskovec (2013). This data set consists of reviews of fine foods from Amazon. The data includes all 568,454 reviews spanning 1999 to 2012. Reviews include product and user information, ratings, and a plain text review.

In this study, I will focus on ‘good reviews’ on Amazon. I defined ‘good reviews’ as the reviews that have 4-star or 5-star customer reviews (out of 5). In other words, if an Amazon review is 4 or 5 stars, it’s called ‘good review’ in this study. 1,2 or 3-star reviews are labeled as ‘bad reviews’.

Data preparation is a critical part because if we fail to prepare the data properly, we can’t perform topic modeling. Here, I won’t dive into detail because data preparation is not the focus of this study. However, be ready to spend some time here in case you have a problem. If you adjust the codes I provide here for your own dataset, I hope you should not have any problem.

#python #topic-modeling #machine-learning #data-science

Topic Modeling with NLP on Amazon : An Application of Latent Dirichlet
2.00 GEEK