Topic Modeling In NLP

With a focus on Latent Dirichlet Allocation

Image for post

In natural language processing, the term topic means a set of words that “go together”. These are the words that come to mind when thinking of this topic. Take sports. Some such words are athlete, soccer, and stadium.

A topic model is one that automatically discovers topics occurring in a collection of documents. A trained model may then be used to discern which of these topics occur in new documents. The model can also pick out which portions of a document cover which topics.

Consider Wikipedia. It has millions of documents covering hundreds of thousands of topics. Wouldn’t it be great if these could be discovered automatically? Plus a finer map of which documents cover which topics. These would be useful adjuncts for people seeking to explore Wikipedia.

We could also discover emerging topics, as documents get written about them. In some settings (such as news) where new documents are constantly being produced and recency matters, this would help us detect trending topics.

This post covers a statistically powerful and widely used approach to this problem.

Latent Dirichlet Allocation

This approach involves building explicit statistical models of topics and documents.

A topic is modeled as a probability distribution over a fixed set of words (the lexicon). This formalizes “the set of words that come to mind when referring to this topic”. A document is modeled as a probability distribution over a fixed set of topics. This reveals the topics the document covers.

The aim of learning is to discover, from a corpus of documents, good word distributions of the various topics, as well as good topic proportions in the various documents. The number of topics is a parameter to this learning.

Generating A Document

At this stage, it will help to describe how to generate a synthetic document from a learned model. This will reveal key aspects of how this model operates that we haven’t delved into yet.

First, we’ll pick the topics this document will cover. One way to do this is to first pick a random document from our corpus, then set the new document’s topic proportions to those of the seed document.

Next, we’ll set the document length, call it n.

Next, we will repeat the following n times:

sample a topic from the document’s topic proportions
sample a word from the chosen topic’s words-distribution

This will emit a sequence of n words. These words will come annotated with the topics they were sampled from.

The resulting document is gibberish. A bag of words sampled from a mix of topics. That’s not a problem — it wasn’t meant to be read. It does reveal which words were generated from which topics, which can be insightful.

Example

Lexicon: {athlete, football, soccer, tennis, computer, smartphone, laptop, printer,Intel, Apple, Google}
Num Topics : 3
Topic 1: {athlete, football, soccer, tennis}
Topic 2: {computer, smartphone, laptop, printer}
Topic 3: {Intel, Apple, Google}
Topic proportions in a document: { 2 ⇒ 70%, 3 ⇒ 30% }

In the above, we’ve described a topic as a set of words. We interpret this as: all the words in the set are equiprobable; the remaining words in the lexicon have zero probability.

Let’s see a 4-word generated document.

Topic:  2      3       2          2
Word: laptop Intel smartphone computer

Topic 3’s proportion in this document (25%) is close to its proportion (30%) in its sampling distribution.

Learning

As usual, this is where things get especially interesting.

First, let’s remind ourselves of the aim of learning. It is to discover, from a corpus of documents, the word distributions of the various topics, and the topic proportions in the various documents. In short, what words describe which topic, and which topics are covered in which document.

The algorithm we’ll describe is in wide use. It is also not hard to understand. It is a form of Gibbs Sampling.

This algorithm works by initially assigning the topics to the various words in the corpus somehow, then iteratively improving these assignments. During its operation, the algorithm keeps track of certain statistics on the current assignments. These statistics help the algorithm in its subsequent learning. When the algorithm terminates, it is easy to “read off” the per-topic word distributions and the per-document topic proportions from the final topic assignments.

Let’s start by describing the statistics mentioned in the previous paragraph. These take the form of two matrices of counts: topic_word and doc_topic. Both are derived from the current assignment of topics to the words in the corpus. topic_word(t,w) counts the number of occurrences of topic t for word w. doc_topic(d,t) counts the number of occurrences of topic t in document d.

Let’s see a numeric example to make sure we got it right. Below we see a two-document corpus along with an assignment of topics to its words. The lexicon is A, B, C.

Doc 1’s words:  A B A C A        Doc 2’s words:  B C C B
Doc 1’s topics: 1 1 1 2 2        Doc 2’s topics: 2 2 2 2

Actually let’s first use this opportunity to muse about some peculiarities we see. In doc 1, notice that A is assigned sometimes to topic 1 and sometimes to topic 2. This is plausible if word A has a high probability in both topics. In doc 2, notice that B is consistently assigned to topic 2. This is plausible if Doc 2 covers only topic 2, and B has a positive probability in topic 2’s distribution.

Okay, now to the two matrices of counts.

topic_word:           doc_topic:
  A B C                    1 2
1 2 1 0                 d1 3 2
2 1 2 3                 d2 0 4

We’ve bolded some entries that are a bit striking. Perhaps doc2 prefers topic 2. Perhaps topic 2 prefers word C.

Ok, let’s start explaining the learning. The first step is to label the words in the corpus with randomly-sampled topics. This sounds easy enough. Actually there is a bit more to it. Instead of hard-coding this random-sampling, it is better to sample off suitable prior distributions. This gives us a potentially powerful mechanism to inject domain knowledge or results from external text analyses.

This priors-based mechanism works as follows. First, we make copies of the two matrices we introduced earlier. Call them prior_topic_word and prior_doc_topic respectively. As before, the entries in these matrices are counts. These counts capture our prior beliefs.

These prior matrices influence the initial assignment of topics. As learning progresses, this influence diminishes, albeit not to zero.

How exactly do we sample the initial assignment of topics from these counts? First, we calculate

P(w|t) = prior_topic_word(t,w)/sum_w’ (prior_topic_word(t,w’))

P(t|d) = prior_doc_topic(t,d)/sum_t’ (prior_doc_topic(t’,d)

P(w|t) is just the fraction of the assignments of topic _t _whose word is w. P(t|d) is just the fraction of the words in document d whose assigned topic is t.

Next, we sample the assignments from these. More specifically, we sample the topic for word w in document d from a distribution whose numerator is P(w|t)P(t|d).

This may be understood as follows. P(w|t)P(t|d) is exactly the probability of generating word w in document d in our generative model. Viewed as a function of t, it captures the likelihood that t was used during this process.

Now let’s discuss setting the values of these counts in the two prior matrices. For our purposes here, all we care about is that no topic be preferred over another. Such preferences would be unwanted biases. We can achieve this by setting all the counts in each matrix to the same positive number. 1 is the simplest choice. Occam’s razor reasoning.

prior_topic_word(t,w)=1 for every topic t and word w
prior_doc_topic(d,t)=1 for every document d and topic t

Okay, so the topic assignments will be sampled from these counts and come out uniformly random.

Subsequent to this initial assignment, we will repeatedly do the following in the hopes of improving the assignment and consequently, our models learned from it:

1\. Pick a word w from a document d in the corpus
2\. Sample a topic t’ from the distribution whose numerator is 
   Q(w|t)Q(t|d)
3\. Set w’s topic in d to t’.

#bayesian-learning #information-retrieval #gibbs-sampling #generative-model #deep learning

With a focus on Latent Dirichlet Allocation

towardsdatascience.com

Topic Modeling In NLP