0%

Understanding the LDA Algorithm

The Latent Dirichlet Allocation (LDA) algorithm is a text mining algorithm that aims to extract topics from long texts. In a nutshell, LDA assumes that each document defines a distribution over topics, and each topic defines a distribution over words. Each word is generated by first sampling a topic from the document, and then sampling a word from the topic. To train an LDA is to solve for the parameters of these two distributions (doc-topic and topic-word) given many documents; To evaluate an LDA usually means predicting the topic distribution for a new unseen document.

This article will introduce LDA in a top-down fashion: Starting with the NeurIPS paper example, it first formulates topic modeling as a parameter estimation problem; then comes some essential math tools (e.g. Dirichlet distribution, Gibbs sampling, etc.); and finally, derives the LDA training and testing algorithms.

👉 Check out my LDA implementation if you are interested! https://github.com/mistylight/Understanding_the_LDA_Algorithm

Example

Imagine you are analyzing papers published in a machine learning conference. There are M=1000 accepted papers, and each paper has a N=200 word abstract. The conference has K=3 topics: vision, language, and graph. In order to analyze which topics each paper addresses, you make the following assumptions:

  • Each paper defines a theme, which is a distribution over topics, e.g.

    1
    {Vision: 0.7, Language: 0.2, Graph: 0.1}
  • Each topic defines a distribution over words (same for all papers), e.g.

    1
    2
    3
    Vision:  {"image":  0.05,  "recognition":  0.01,  "detection":  0.01,  ...} 
    Language: {"text": 0.07, "semantic": 0.02, "summarization": 0.01, ...}
    Graph: {"node": 0.04, "edge": 0.04, "clustering": 0.01, ...}
  • The words in the papers are generated using the following procedure (as illustrated in the figure below):

    for m = 1…M // loop across documents

    • Randomly generate a theme θ\vec\theta, e.g. {Vision: 0.7, Language: 0.2, Speech: 0.1}
    • for n = 1…N // loop across words
      • Randomly generate a topic zz according to the theme, e.g. Vision
      • Randomly generate a word ww according to the topic, e.g. "image"
    YBjCVUfx6bgaWpH
    Generating words in one paper according to the LDA model. Inspired by CS 221 course note [3]

During LDA training, we are interested in solving the following 2 problems:

  • Finding the topic distribution for each of the M existing papers (ΘRM×K\mathbf \Theta \in \mathbb{R}^{M\times K});
  • Finding the word distribution for each of the K topics (ΦRK×V\mathbf \Phi \in \mathbb{R}^{K\times V}, where V is the vocabulary size);

We can then use these two matrices in LDA testing:

  • Given a new unseen paper, find its topic distribution (θnewRK\vec\theta_{\text{new}} \in \mathbb{R}^K).

The Dirichlet distribution 101

Prior, data, and posterior

Let’s take a closer look at the parameters we want to solve: the topic distribution for each of the M documents (Θ\mathbf \Theta) and the word distribution for each of the K topics (Φ\mathbf \Phi). In order to estimate them, the LDA algorithm makes some assumptions about their prior. What is a prior? It’s our belief about the parameters without observing the data. In contrast, after you observe the data, you get the posterior. Usually we are interested in the posterior, since it incorporates the information of data and therefore is good for estimating the parameters using methods like MAP (maximum-a-posteriori) or EAP (expectated-a-posteriori).

Here’s an example: Imagine you have a coin which you believe to be a fair coin, so you may think its probability of showing head is most likely around 0.5 (prior P(pcoin)P(p_{\text{coin}})). However, after tossing it several times, it always shows head (data). Therefore your belief about its head probability gradually shifts from 0.5 to very close to 1 as the evidence accumulates (posterior P(pcoindata)P(p_{\text{coin}}\mid \text{data})). In this example, both the prior and the posterior of the coin’s probability can be nicely modeled with the Beta distribution. We can either report the expectation or the maximum of the posterior to answer the question “What do you think is the probability of coin after observing several tosses?”

Example prior P(pcoin)P(p_{\text{coin}}) Example posterior P(pcoindata)P(p_{\text{coin}}\mid \text{data}) after tossing and observing heads only
image-20220523013827493
SvBhrbLJVAKaoWE

Similarly, given a paper, we may have a prior that its topic distribution is most likely symmetric, e.g. {Vision: 1/3, Language: 1/3, Graph: 1/3} (prior P(θ)P(\vec\theta)). Similar to the coin, you may imagine this as a three-sided dice. After reading the paper through, you never encountered a single word about graphs, which means for 99% of the time you are seeing words like “image”, “captioning”, “text”, and so on (data W\mathbf W). At this moment your belief about its topic ratio (the probability of the dice) may shift towards something like {Vision: 0.5, Language: 0.5, Graph: 0.0} (posterior P(θW)P(\vec\theta \mid \mathbf{W})). In this example, both the prior and the posterior can be nicely modeled with the Dirichlet distribution. We can either report the expectation or the maximum of the posterior to answer the question “What do you think is the topic distribution of the paper after reading it through?”

Example prior P(θ)P(\vec\theta) Example posterior P(θdata)P(\vec\theta\mid \text{data}) after observing only vision and language related words in the paper
SyJxvR1KsdQWk7E

(figure credit)
ZAMuTaYJtPlSVkn

(figure credit)

Note that in the above two examples, the Beta/Dirichlet distribution aren’t our only choice. We choose them mainly because they are mathematically convenient, for if the prior is Beta/Dirichlet, then the posterior is also Beta/Dirichlet. This algebraic convenience is frequently referred to as the “conjugate prior”.

The Dirichlet Distribution: Jar of dices

We’ve been using sloppy terms like “belief” to describe the following intuitions:

  1. A prior/posterior is a distribution that models the probability density of the estimated parameter. If we make the analogy that the topic distribution of a paper (a.k.a the “theme”) is like a K-sided dice, then the Dirichlet distribution is just like a jar of dices. Each time you reach in (which we call “sampling”), you get a new dice. In the paper example, before you observe the data, the jar is considered to be more likely to give you a fair dice; After you observe the data, the jar is considered to be more likely to give you a biased dice.
  2. The expectation of the posterior shifts as you accumulate more data. This is roughly equivalent to sampling from the jar many times and taking the average of the dices you get.

Now let’s take a closer look at the Dirichlet distribution, our jar of dices.

image-20220524231421088

Recall that, a “dice” refers to a multinomial distribution Mult(p1,...,pn)\text{Mult}(p_1,...,p_n) with i=1npi=1\sum_{i=1}^n p_i=1 and pi0p_i \geq 0. A jar of dices is then a multivariate distribution over nn variables p=[p1,...,pn]\vec{p} = [p_1, ...,p_n], where pi=1\sum p_i=1 and pi0p_i \geq 0. And remember our ultimate goal is to estimate the dice by solving for the posterior, which is the jar after observing some data.

So, what is the data? Intuitively, the best way to collect data in order to estimate a dice, is to toss it many times and count the occurrences of each outcome. Here’s an example: Imagine you toss a three-sided doc-topic dice 100 times, and this is what you get:

  • Vision: 50 (50%)
  • Language: 40 (40%)
  • Graph: 10 (10%)

And suppose your prior is that the dice should be fair. The most naive way to incorporate this belief, is to add a “pseudo-count” to your observation. The amount of “pseudo-count” depends on the strength of your prior: If you really think the dice should be as fair as possible despite the observation, then you probably want to add a large pseudo-count (e.g. 100) to your data, which pulls the expectation of the posterior much closer to a fair dice:

  • Vision: 50+100 (37.5%)
  • Language: 40+100 (35%)
  • Graph: 10+100 (27.5%)

Otherwise you may want to add a tiny bit of pseudo-count (e.g. 1), which perturbs the observation by a small amount:

  • Vision: 50+1 (49.5%)
  • Language: 40+1 (39.8%)
  • Graph: 10+1 (10.7%)

And it turns out the Dirichlet distribution respects this pseudo-count intuition: It is parameterized by a parameter α={α1,...,αn}\vec{\alpha}=\{\alpha_1,...,\alpha_n\}, where each αi\alpha_i can be regarded as the pseudo-count for the ii-th outcome of the dice. We denote the Dirichlet distribution as P(p;α)P(\vec{p};\vec{\alpha}), which means it’s a distribution with regard to the dice probabilities p\vec{p} and parameterized by the pseudo-count α\vec\alpha. It has the following two nice properties:

  • The Dirichlet posterior Dir(px;α)\text{Dir}(\vec{p}\mid {\vec{x}};\vec\alpha) after observing data x=[x1,...,xn]\vec{x}=[x_1,...,x_n] (xix_i being the occurrences of the ii-th outcome) is a new Dirichlet distribution incorporating the prior’s pseudo-count with the observed data, which makes it straightforward to compute the posterior:

    Dir(px;α)=Dir(p;α+x)\text{Dir}(\vec{p}\mid \vec{x}; \vec{\alpha}) = \text{Dir}(\vec{p}; \vec{\alpha}+\vec x)

  • The expectation of a Dirichlet prior Dir(p;α)\text{Dir}(\vec{p};\vec{\alpha}) is equal to the ratios between the αi\alpha_is, which makes it straightforward to estimate the value of the parameter from a Dirichlet posterior:

    E[Dir(p;α)]=[α1i=1nαi,α2i=1nαi,...,αni=1nαi]\mathbb{E}[\text{Dir}(\vec{p}; \vec{\alpha})] = [\dfrac{\alpha_1}{\sum_{i=1}^n \alpha_i}, \dfrac{\alpha_2}{\sum_{i=1}^n \alpha_i},...,\dfrac{\alpha_n}{\sum_{i=1}^n \alpha_i}]

We are not going to dive into too much details, e.g. the definition of the Dirichlet distribution can be found here, and there are several interpretations for the intuition behind its density function [3]. However, to me I found it helpful to just regard it as a convenient math tool invented adhoc to model the jar of dices (i.e. the distribution of Multinomial distributions). And surprisingly, the LDA algorithm will not make use of its density function; A solid grasp of the above two properties is sufficient.

Back to our problem

Now going back to our problem: The parameters to estimate can be imagined as M doc-topic dices (each with K faces, as in ΘRM×K\mathbf{\Theta}\in\mathbb{R}^{M\times K}) and K topic-word dices (each with V faces, as in ΦRK×V\mathbf{\Phi}\in\mathbb{R}^{K\times V}​). The M doc-topic dices share the same prior, while the K topic-word dices share the other prior.

Extending our analogy: Imagine there are 2 jars of dices, one jar filled with doc-topic dices, from which you randomly take M of them (Θ={θ1,...,θM}\mathbf{\Theta}=\{\vec{\theta_1},...,\vec{\theta_M}\}, where each θm\vec\theta_m is a K-dim distribution over topics). These become the topic distributions of the M documents; Another jar filled with topic-word dices, from which you randomly take K of them (Φ={φ1,...,φK}\mathbf{\Phi}=\{\vec{\varphi_1},...,\vec{\varphi_K}\}, where each φk\vec\varphi_k is a V-dim distribution over words). These become the word distributions of the K topics.

YJWyZtK3w2MAgVq

Our goal is to estimate the values of Θ\mathbf \Theta and Φ\mathbf \Phi by solving for their posteriors and taking the expectations:

doc-topic dices Θ\mathbf\Theta topic-word dices Φ\mathbf \Phi
prior Dir(α)\text{Dir}(\vec\alpha), where α\vec\alpha is a hparam Dir(β)\text{Dir}(\vec\beta), where β\vec\beta is a hparam
data The number of words with each topic (e.g. 100 words under Vision, 200 words under Language, 300 words under Graph) The occurrences of different words generated by the same topic (e.g. for topic Vision, “image” appeared 10 times, “recognition” appeared 20 times, etc.)
posterior to be solved by LDA to be solved by LDA

The LDA model

Problem formulation

Now refining the random sampling parts in our initial pseudo-code, we can summarize the LDA generative procedure as follows (see the figure below for an example):

// initializing the M doc-topic dices and the K topic-word dices

  • for m = 1…M // loop across documents
    • Randomly sample a doc-topic dice θmDir(α)\vec{\theta_m} \sim \text{Dir}(\vec\alpha)
      // K-dim probability distribution over all topics,
      // e.g. {Vision: 0.7, Language: 0.2, Speech: 0.1}
  • for k = 1…K // loop across topics
    • Randomly sample a topic-word dice φkDir(β)\vec{\varphi_k} \sim \text{Dir}(\vec\beta)
      // V-dim probability distribution over all words,
      // e.g. {"image": 0.05, "recognition": 0.01, "detection": 0.01, ...}

// Generate the words

  • for m = 1…M // loop across documents
    • for n = 1…N // loop across words
      • Randomly sample a topic zmnMult(θm)z_{mn} \sim \text{Mult}(\vec{\theta_m})
        // Integer in {1,...,K}\{1,...,K\} representing a topic, e.g. Vision
      • Randomly sample a word wmnMult(φzmn)w_{mn} \sim \text{Mult}(\vec{\varphi}_{z_{mn}})
        // Integer in {1,...,V}\{1,...,V\} representing a word, e.g. "image"
Screen Shot 2022-05-26 at 12.34.47 AM

Our goals are as follows:

  • For the train set documents, infer the latent variables Θ\mathbf{\Theta}, Φ\mathbf{\Phi} (i.e. the M doc-topic dices and the K topic-word dices) from the observed words W\mathbf{W} (The Dirichlet prior α,β\vec\alpha, \vec\beta are hyperparameters of the LDA model) by taking the expectation of the posterior:

    Θ^=E[P(ΘW;α,β)]Φ^=E[P(ΦW;α,β)]\begin{aligned} \hat{\mathbf{\Theta}} &= \mathbb{E}[P(\mathbf{\Theta} \mid \mathbf{W} ; \vec\alpha, \vec\beta)]\\ \hat{\mathbf{\Phi}} &= \mathbb{E}[P(\mathbf{\Phi} \mid \mathbf{W} ; \vec\alpha, \vec\beta)] \end{aligned}

  • For an unseen test document with words Wnew\mathbf{W}_{\text{new}}, infer its latent variable θnew\vec\theta_{\text{new}} (remember we sample a new doc-topic dice for each document, but reuse the topic-word dices for all documents. Therefore we only need to infer a new θ\vec\theta, not φ\vec\varphi). This can be achieved by computing the expectation of the posterior:

θnew^=E[P(θnewWWnew;α,β)]\hat{\vec{\theta_{\text{new}}}} = \mathbb{E}[P(\vec\theta_{\text{new}}\mid \mathbf{W}\cup\mathbf{W_{\text{new}}} ; \vec\alpha, \vec\beta)]

The hidden variable Z

How do we estimate the posterior of Θ\mathbf\Theta and Φ\mathbf\Phi? Well, remember we discussed that it’s straightforward to compute the Dirichlet posteriors as long as we have the data handy:

doc-topic dices Θ\mathbf\Theta topic-word dices Φ\mathbf \Phi
prior Dir(α)\text{Dir}(\vec\alpha), where α\vec\alpha is a hparam Dir(β)\text{Dir}(\vec\beta), where β\vec\beta is a hparam
data The number of words with each topic (e.g. 100 words under Vision, 200 words under Language, 300 words under Graph) The occurrences of different words generated by the same topic (e.g. for topic Vision, “image” appeared 10 times, “recognition” appeared 20 times, etc.)
posterior to be solved by LDA to be solved by LDA
Screen Shot 2022-05-28 at 11.10.51 PM

The problem is that we don’t really know which topic is behind each word! If you are familiar with the EM algorithm, you’d probably know this is called a “hidden variable”. Here, the topic variable ZRM×N\mathbf Z\in\mathbb{R}^{M\times N} (One topic for each of the N words in each of the M documents) is crucial for computing the posterior (see the table above), yet it’s unknown to us. One way to bypass this challenge, is to sample the value of Z\mathbf Z given the words WRM×N\mathbf W \in \mathbb{R}^{M\times N}:

ZP(ZW)\mathbf Z \sim P(\mathbf Z \mid \mathbf W)

The intuition here is that we want to know which topics are most likely given the words we see, and thus using them to estimate the parameters would make the most sense. And here we make the approximation of using one sample of Z\mathbf Z to represent the average scenario of all possible Z\mathbf Zs. Nowadays, this is a strategy widely adopted by neural networks with stochastic components (e.g. VAEs).

Mathematically, we can dive deeper to see why sampling Z\mathbf Z from P(ZW)P(\mathbf Z \mid \mathbf W) makes sense. The reader may refer to the appendix for a more in-depth explanation (warning: math ahead).

Gibbs Sampling

But how do you generate samples from a distribution? Simple as it may seem, this isn’t a trivial question. That being said, there are many well-developed algorithms that does exactly this. Gibbs sampling is the one we will use here because it’s mathematically convenient under the LDA framework. We’ll introduce the algorithm here without explaining why it works, but interested readers may refer to this tutorial (or: section 0.4 of [3] if you read Chinese).

img
Gibbs sampling in 2D. Figure credit: Jessica Stringham's blog.

Algorithm: Gibbs sampling

  • Input: The math equation of p(x)p(\vec x). That is, for any given x\vec x, we know how to compute its probability; This, however, doesn’t always make it feasible to compute statistics like expecatation/variances, which usually requires an integral and can be intractable in practice.

  • Output: Samples {x(1),x(2),...,x(T)}\{\vec x^{(1)},\vec x^{(2)},...,\vec x^{(T)}\} drawn from p(x)p(\vec x). We can then make use of these samples to approximate the statistics we are interested in, e.g. the expectation as the average across all samples.

  • Screen Shot 2022-05-27 at 11.43.01 PM
    Sampling from a probability distribution.

// random initialization for x(0)RD\vec x^{(0)} \in \mathbb{R}^D

  • [x1(0),x2(0),...,xD(0)]random vector[x^{(0)}_1,x_2^{(0)},...,x_D^{(0)}]\leftarrow \text{random vector}

// iterative sampling

  • for t in 0…T // repeat while not converged
    • for d in 1…dim(x\vec x): // loop through the dimensions of x\vec x
      • sample xd(t+1)P(x1(t+1),...xd1(t+1),xd+1(t),..,xD(t))()x^{(t+1)}_d \sim P(\cdot \mid x_1^{(t+1)},...x_{d-1}^{(t+1)},x_{d+1}^{(t)},..,x_D^{(t)}) \quad (*)
    • output x(t+1)\vec x^{(t+1)} as a new sample

In one sentence, what (*) does is: Freeze the values of all other dimensions, and sample the current dimension according to their values. For instance, in 2D, Gibbs sampling alternates between sampling xx while freezing yy, and sampling yy while freezing xx:

img
Gibbs sampling path in 2D. Figure credit: Jessica Stringham's blog.

And in the 3D scenario, each Gibbs sampling iteration (i.e. the inner loop) is equivalent to:

x1(t+1)P(x2(t),x3(t))      // freeze x2,x3, only sample x1x2(t+1)P(x1(t+1),x3(t))   // freeze x1,x3, only sample x2x3(t+1)P(x1(t+1),x2(t+1))// freeze x1,x2, only sample x3\begin{aligned} x_1^{(t+1)} &\sim P(\cdot \mid x_2^{(t)}, x_3^{(t)}) \quad~~~~~~\text{// freeze }x_2, x_3\text{, only sample }x_1\\ x_2^{(t+1)} &\sim P(\cdot \mid x_1^{(t+1)}, x_3^{(t)})\quad~~~\text{// freeze }x_1, x_3\text{, only sample }x_2\\ x_3^{(t+1)} &\sim P(\cdot \mid x_1^{(t+1)}, x_2^{(t+1)})\quad\text{// freeze }x_1, x_2\text{, only sample }x_3 \end{aligned}

LDA Training

Recall that, as mentioned at the beginning of this article, the goal of LDA training is to estimate Θ\mathbf \Theta and Φ\mathbf \Phi given the words we see, which requires sampling Z\mathbf Z from P(ZW)P(\mathbf Z \mid \mathbf W). Let’s do it with Gibbs sampling.

As you probably already noticed, the only tricky part is the (*) line, which means “Freezing the values of all the other dimensions, and sampling the current dimension according to their values.” For sampling Z\mathbf Z, this means “freezing the topics of all the other words, and sampling the topic of the current word.” This can be written as:

zjP(zj=kZ¬j,W)z_j \sim P(z_j=k\mid \mathbf{Z}_{\neg j}, \mathbf{W})

Which means, to sample the topic for the jj-th word in the corpus (we can imagine “flattening” the corpus by concatenating the words in each of the papers into a long word list, and the index jj runs through this long word list), we hold the topics for all the other words (Z¬j\mathbf Z_{\neg j}) fixed.

Screen Shot 2022-05-28 at 10.46.47 PM

And if we go a little further, each step of the Gibbs sampling process can be transformed into…

P(zj=kZ¬j,W)// P(A|B)  P(A, B) when B is observed.P(zj=k,wj=tZ¬j,W¬j)// Marginalize over θ and φ by taking the integral.=θmφkP(zj=k,wj=t,θm,φkZ¬j,W¬j)dθmdφk// The above probability corresponds to the product of the following events:// 1) θm is sampled from the posterior Dirichlet distribution given Z¬j,W¬j;// 2) Topic k is sampled from θm with probability θmk;// 3) φk is sampled from the posterior Dirichlet distribution given Z¬j,W¬j;// 4) Word t is sampled from φk with probability φkt;=θmθmkP(θmZ¬j,W¬j)dθmφkφktP(φkZ¬j,W¬j)dφk// The above two integrals are equivalent to computing the following two expectations:=E[θmkZ¬j,W¬j]E[φktZ¬j,W¬j]// Recall that the posteriors of Θ and Φ are Dirichlet distributions,// whose parameters are the sum between the pseudo-counts α, β and the observed counts.// So we count the co-occurrences between the M documents and the K topics NM×K,¬j,// and the co-occurrences between the K topics and the V words NK×V,¬j,// and use them to calculate the expectation of the Dirichlet posterior.// The ¬j subscription means that we exclude word j from the co-occurrences.=αk+Nmk,¬jk(αk+Nmk,¬j)βt+Nkt,¬jt(βt+Nkt,¬j)()\begin{aligned} &\quad P(z_j=k\mid \mathbf{Z}_{\neg j}, \mathbf{W}) \\ &\textcolor{darkgreen}{\quad\text{\texttt{// P(A|B) ∝ P(A, B) when B is observed.}}}\\ &\propto P(z_j=k, w_j=t \mid \mathbf{Z}_{\neg j}, \mathbf{W}_{\neg j}) \\ &\textcolor{darkgreen}{\quad\text{\texttt{// Marginalize over θ and φ by taking the integral.}}}\\ &= \int_{\vec\theta_m}\int_{\vec\varphi_k} P(z_j=k, w_j=t, \vec\theta_m ,\vec\varphi_k \mid \mathbf{Z}_{\neg j},\mathbf{W}_{\neg j}) d\vec\theta_m d\vec\varphi_k\\ &\textcolor{darkgreen}{\quad\text{\texttt{// The above probability corresponds to the product of the following events:}}}\\ &\textcolor{darkgreen}{\quad\text{\texttt{// 1) θm is sampled from the posterior Dirichlet distribution given }}\mathbf{Z}_{\neg j}, \mathbf{W}_{\neg j};}\\ &\textcolor{darkgreen}{\quad\text{\texttt{// 2) Topic k is sampled from θm with probability θmk;}}}\\ &\textcolor{darkgreen}{\quad\text{\texttt{// 3) φk is sampled from the posterior Dirichlet distribution given }}\mathbf{Z}_{\neg j}, \mathbf{W}_{\neg j};}\\ &\textcolor{darkgreen}{\quad\text{\texttt{// 4) Word t is sampled from φk with probability φkt;}}}\\ &= \int_{\vec\theta_m} \theta_{mk} P(\vec\theta_m\mid \mathbf{Z}_{\neg j}, \mathbf{W}_{\neg j})d\vec{\theta}_m \cdot \int_{\vec\varphi_k} \varphi_{kt}P(\vec\varphi_{k}\mid \mathbf{Z}_{\neg j},\mathbf{W}_{\neg j})d\vec\varphi_k\\ &\textcolor{darkgreen}{\quad\text{\texttt{// The above two integrals are equivalent to computing the following two expectations:}}}\\ &= \mathbb{E}[\theta_{mk} \mid \mathbf{Z}_{\neg j}, \mathbf{W}_{\neg j}] \cdot \mathbb{E}[\varphi_{kt}\mid \mathbf{Z}_{\neg j}, \mathbf{W}_{\neg j}] \\ & \textcolor{darkgreen}{\quad\text{\texttt{// Recall that the posteriors of Θ and Φ are Dirichlet distributions,}}}\\ &\textcolor{darkgreen}{\quad\text{\texttt{// whose parameters are the sum between the pseudo-counts α, β and the observed counts.}}}\\ &\textcolor{darkgreen}{\quad\text{\texttt{// So we count the co-occurrences between the M documents and the K topics }}\mathbf{N}_{M\times K,\neg j},}\\ &\textcolor{darkgreen}{\quad\text{\texttt{// and the co-occurrences between the K topics and the V words }}\mathbf{N}_{K\times V,\neg j},}\\ &\textcolor{darkgreen}{\quad \text{\texttt{// and use them to calculate the expectation of the Dirichlet posterior.}}}\\ &\textcolor{darkgreen}{\quad \text{\texttt{// The }}\neg j \text{\texttt{ subscription means that we exclude word j from the co-occurrences.}}}\\ &= \dfrac{\alpha_k+N_{mk, \neg j}}{\sum_{k'} (\alpha_{k'}+N_{mk',\neg j})} \cdot \dfrac{\beta_t+N_{kt,\neg j}}{\sum_{t'}(\beta_{t'}+N_{kt',\neg j})} \quad (**) \end{aligned}

This part is quite dense. In plain English, what this does is: Repeat until converge: Loop through each word, for each word, you count the doc-topic and topic-word co-occurrences without taking into account the current word and its topic; Re-sample the topic of the current word according to (**). In the end, the sequence of the values of the variable Z\mathbf Z generated by Gibbs sampling will be samples of P(ZW)P(\mathbf Z \mid \mathbf W).

For estimating the parameters Θ\mathbf \Theta and Φ\mathbf\Phi, we can either take the average between multiple samples, or just approximate with one sample. If we go with one sample, then the estimation for Θ^,Φ^\hat{\mathbf\Theta}, \hat{\mathbf{\Phi}} will be:

θmk=αk+Nmkk(αk+Nmk)φkt=βt+Nktt(βt+Nkt)\begin{aligned} \theta_{mk} &= \dfrac{\alpha_k+N_{mk}}{\sum_{k'} (\alpha_{k'}+N_{mk'})} \\ \varphi_{kt} &= \dfrac{\beta_t+N_{kt}}{\sum_{t'} (\beta_{t'}+N_{kt'})} \end{aligned}

by computing the expectation of their Dirichlet posteriors – combining the pseudo-counts with the observed counts. Here, after the sampling, NmkN_{mk} refers to the count of words with topic k in document m; NktN_{kt} refers to the co-occurrence between word t and topic k in the entire corpus. For Θ\mathbf\Theta, we are combining the pseudo-count α\vec\alpha with the observed doc-topic co-occurrences; For Φ\mathbf \Phi, we are combining the pseudo-count β\vec\beta with the observed topic-word co-occurrences.

We may also optionally take multiple samples of Z\mathbf Z, use them to estimate Θ\mathbf \Theta and Φ\mathbf\Phi separately, and take the average in between. This would take more computational resources, but could provide a less biased estimation.

LDA Testing

The goal of LDA testing is to estimate the topic distribution θnew\vec\theta_{\text{new}} given a new document. The only difference as compared to LDA training, is that the evidence is now W+Wnew\mathbf W+\mathbf W_{\text{new}}, and all we need to do is to sample the topics for each word in the new document. The Gibbs sampling formula changes to:

P(zj=kZ¬j,W+Wnew)P(zj=k,wj=tZ¬j,W,Wnew,¬j)=αk+Nk,¬jk(αk+Nk,¬j)βt+Nkt+Nnew,kt,¬jt(βt+Nkt+Nnew,kt,¬j)()\begin{aligned} &\quad P(z_j=k\mid \mathbf{Z}_{\neg j}, \mathbf{W}+\mathbf{W}_{\text {new}}) \\ &\propto P(z_j=k, w_j=t \mid \mathbf{Z}_{\neg j}, \mathbf{W},\mathbf{W_{\text{new}, \neg j}}) \\ &= \dfrac{\alpha_k+N_{k, \neg j}}{\sum_{k'} (\alpha_{k'}+N_{k',\neg j})} \cdot \dfrac{\beta_t+N_{kt}+N_{\text{new},kt, \neg j}}{\sum_{t'}(\beta_{t'}+N_{kt'}+N_{\text{new},kt',\neg j})} \quad (***) \end{aligned}

Where, Nk,¬jN_{k,\neg j} refers to the count of words with topic k in the new document (excluding the j-th word), and Nnew,kt,¬jN_{\text{new},kt,\neg j} refers to the co-occurrence of word t with topic k in the new document (excluding the j-th word).

The estimation for θnew\vec\theta_{\text{new}} would be (where NkN_k refers to the count of words with topic k in the new document, after we are done with the topic sampling):

θnew,k=αk+Nkk(αk+Nk)\theta_{\text{new},k} = \dfrac{\alpha_k+N_k}{\sum_{k'}(\alpha_{k'}+N_{k'})}

Example: Topics shift in NeurIPS (19xx-20xx)

This section will use a practical example (Papers in NeurIPS) to demonstrate the use of the LDA training and testing. All the codes can be found at https://github.com/mistylight/Understanding_the_LDA_Algorithm.

Data

We use the NeurIPS papers dataset from this Kaggle competition, which contains the content of the accepted NeurIPS papers ranging between 1987 to 2016. We select the earliest 1000 papers and the latest 1000 papers respectively and train 2 separate LDA models. I’m particularly interested in one question: “How do the topics of NeurIPS shift over the course of ~30 years?”

For each paper, I keep the first 200 words from the abstract, and limit the size of the vocabulary to be 10000. I set the number of topics to be K=10, as I expect the NeurIPS conference to have some level of topic diversity but not too much.

LDA Training

Here, I’ll show the result for the topic-word dices Φ\mathbf\Phi as it’s usually of higher interest to us – what topics are present in the corpus? What does each topic look like?

The 19xx papers

Here are the 10 topics and their top 10 words for the earliest 1000 papers (in other words, they are published in the 20th century). The deeper the color is, the more weight the word has in the corresponding topic:

68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f6d697374796c696768742f7069636265642f6d61696e2f4865786f2f53637265656e25323053686f74253230323032322d30322d32382532306174253230342e35392e3536253230504d2e706e67 (1)

We can roughly see what most of the topics correspond to (note that this interpretation is purely subjective!):

  • Topic 0: training ML models
  • Topic 1: math symbols / circuit design
  • Topic 2: signal processing
  • Topic 3: reinforcement learning
  • Topic 4: neural networks
  • Topic 5: kernel methods
  • Topic 6: speech recognition
  • Topic 7: biological inspiration for neural networks
  • Topic 8: computer vision
  • Topic 9: probability theory / bayesian networks

The 20xx papers

Here are the 10 topics and their top 10 words for the latest 1000 papers (so they are published around 2016):

68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f6d697374796c696768742f7069636265642f6d61696e2f4865786f2f53637265656e25323053686f74253230323032322d30322d32382532306174253230352e31302e3132253230504d2e706e67

And we can also try to interpret the meaning of each topic, however this time we can’t explain them super well:

  • Topic 0: training deep neural networks
  • Topic 1: general machine learning
  • Topic 2: (?)
  • Topic 3: generative models (GAN/VAE)
  • Topic 4: neural networks
  • Topic 5: (?)
  • Topic 6: (?)
  • Topic 7: reinforcement learning
  • Topic 8: graph analysis
  • Topic 9: computer vision

It’s interesting to see that some topics remain popular after ~30 years, e.g. reinforcement learning and computer vision; Some topics aren’t as popular as they were, e.g. kernel methods; Finally, some topics gain increasing popularity nowadays, e.g. graph analysis.

LDA Testing

Given a new unseen paper, the LDA testing aims to find its topic distribution. Here, I’ll give an example of using the LDA model trained on the 19xx papers to predict the topics of an unseen 19xx paper:

1
2
3
4
Title: Decomposition of Reinforcement Learning for Admission Control
Abstract: This paper presents predictive gain scheduling, a technique for simplifying reinforcement learning problems by decomposition. Link admission
control of self-similar call traffic is used to demonstrate the technique.
The control problem is decomposed into on-line prediction of near-future call arrival rates, and precomputation of policies for Poisson call arrival processes. At decision time, the predictions are used to select among the policies. Simulations show that this technique results in significantly faster learning without any performance loss, compared to a reinforcement learning controller that does not decompose the problem.

The predicted topics consist of the following K=10-dim vector (recall that this is the θnew\vec\theta_{\text{new}}):

1
[0.1299, 0.0005, 0.0005, 0.6124, 0.0005, 0.0005, 0.2144, 0.0403, 0.0005, 0.0005]

where, each number corresponds to one of the topics. The top topic is Topic 3 (with a weight of 60%+), which aligns with our intuition that this paper is highly relevant to reinforcement learning.

Closing Thoughts

When I began learning the LDA algorithm in my junior year, I started with the most well-known and comprehensive tutorial in Chinese: LDA Math Gossip (LDA数学八卦) [3]. It is a great tutorial if you are purely interested in math and absolutely not in a hurry to understand and to implement LDA for your project. While I truly appreciate that the tutorial assumes no prior knowledge of stats ML and explains every component of the algorithm with great clarify, I got distracted by the amount of details/proofs that could have been treated as a black box without much harm. It was so hard to keep track of all the variables and their contexts, that when I first read it, I was completely lost. I needed to frequently go back and forth to answer the tons of questions popping into my mind: What is phi? Why are you sampling Z in order to estimate phi? Wait what is Z again? And so on. I believe this struggle isn’t unique and it becomes the motivation to write this article.

At the same time, I believe these problems are inevitable if the tutorial is written in a bottom-up fashion. It’s like showing every single pixel at the beginning and gradually zooming out. I try to do the reverse – showing the picture at the beginning and gradually zooming in. The downside is that I can’t go as deep and rigorous into the math as LDA Math Gossip does, which might be unsatisfactory for some readers, but my hope is that it’ll be more friendly to engineers like me, especially to those looking for some intuition about the algorithm, but can’t afford much time to go through all the details.

References

[1] Latent Dirichlet Allocation. David M. Blei, Andrew Y. Ng, Michael I. Jordan. 2003. https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

[2] Parameter estimation for text analysis. Gregor Heinrich. 2005. http://www.arbylon.net/publications/text-est.pdf

[3] LDA数学八卦. 靳志辉. 2013. https://bloglxm.oss-cn-beijing.aliyuncs.com/lda-LDA数学八卦.pdf

Appendix

You’ve been warned…

warning-math

Why do we want to sample Z

In order to compute the expectation for Θ\mathbf\Theta and Φ\mathbf\Phi, we need to marginalize upon the joint probability distribution P(Θ,Φ,ZW)P(\mathbf\Theta, \mathbf\Phi, \mathbf Z \mid \mathbf W):

Θ^=EΘ[P(Θ,Φ,ZW)]// This is how expectation is defined...=ZΘΦΘP(Θ,Φ,ZW)dΘdΦ// P(A,B,C) = P(A|B,C) P(B|C) P(C)=ZΘΦΘP(ΘZ,W)P(ΦZ,W,Θ)Φ’s estimation only depends on Z and WP(ZW)dΘdΦ// Turns out Θ and Φ are conditional independent given Z and W,// because the posterior of Θ (doc-topic dices) can be estimated solely based on Z, // which determines the topic distribution of each document; // The posterior of Φ (topic-word dices) can be solely estimated from Z and W, // which determines the word distribution of each topic.=ZΘΦΘP(ΘZ,W)posterior of doc-topic dicesP(ΦZ,W)posterior of topic-word dicesP(ZW)dΘdΦ// Recall that the posteriors of Θ and Φ are Dirichlet distributions,// whose parameters are the sum between the pseudo-counts α, β and the observed counts.// So we count the co-occurrences between the M documents and the K topics NM×K,// and the co-occurrences between the K topics and the V words NK×V,// which yields the Dirichlet posterior of Θ and Φ as follows:=ZΘΦΘDir(Θα+NM×K])Dir(Φβ+NK×V)P(ZW)dΘdΦ// Just re-arranging, nothing special...=ZP(ZW)ΘΘDir(Θα+NM×K])dΘ=EΘ[Dir(Θα+NM×K)]ΦDir(Φβ+NK×V)dΦ=1// The first integral is the expectation of Θ, and the second integral sums up to 1.=ZP(ZW)EΘ[Dir(Θα+NM×K)]// The expectation of Dir() is the (normalized) ratio between its parameters. =ZP(ZW)intractable[N^mk=αk+Nmkk(αk+Nmk)]M×Ktrivial to compute given Z\begin{aligned} \hat{\mathbf \Theta} &= \mathbb{E}_{\mathbf\Theta}[P(\mathbf{\Theta}, \mathbf{\Phi}, \mathbf{Z} \mid \mathbf{W} )] \\ &\quad\text{\texttt{// This is how expectation is defined...}}\\ &=\sum_{\mathbf{Z}}\int_{\mathbf{\Theta}}\int_{\mathbf\Phi}\mathbf{\Theta} \cdot P(\mathbf{\Theta}, \mathbf{\Phi}, \mathbf{Z} \mid \mathbf{W} ) \cdot d\mathbf{\Theta}d\mathbf{\Phi} \\ &\quad\text{\texttt{// P(A,B,C) = P(A|B,C) P(B|C) P(C)}}\\ &=\sum_{\mathbf{Z}}\int_{\mathbf{\Theta}}\int_{\mathbf\Phi} \mathbf{\Theta} \cdot P(\mathbf{\Theta}\mid \mathbf{Z},\mathbf{W})\cdot \underbrace{P(\mathbf{\Phi}\mid \mathbf{Z},\mathbf{W},\sout{\mathbf{\Theta}})}_{\mathbf{\Phi}\text{'s estimation only depends on }\mathbf{Z} \text{ and }\mathbf{W}}\cdot P(\mathbf{Z} \mid \mathbf{W} )\cdot d\mathbf{\Theta}d\mathbf{\Phi}\\ &\quad\text{\texttt{// Turns out Θ and Φ are conditional independent given Z and W,}}\\ &\quad\text{\texttt{// because the posterior of Θ (doc-topic dices) can be estimated solely based on Z, }}\\ &\quad\text{\texttt{// which determines the topic distribution of each document; }}\\ &\quad\text{\texttt{// The posterior of Φ (topic-word dices) can be solely estimated from Z and W, }}\\ &\quad\text{\texttt{// which determines the word distribution of each topic.}}\\ &=\sum_{\mathbf{Z}}\int_{\mathbf{\Theta}}\int_{\mathbf\Phi} \mathbf{\Theta} \cdot \underbrace{P(\mathbf{\Theta}\mid \mathbf{Z},\mathbf{W})}_{\text{posterior of doc-topic dices}}\cdot \underbrace{P(\mathbf{\Phi}\mid \mathbf{Z},\mathbf{W})}_{\text{posterior of topic-word dices}}\cdot P(\mathbf{Z} \mid \mathbf{W})\cdot d\mathbf{\Theta}d\mathbf{\Phi}\\ &\quad\text{\texttt{// Recall that the posteriors of Θ and Φ are Dirichlet distributions,}}\\ &\quad\text{\texttt{// whose parameters are the sum between the pseudo-counts α, β and the observed counts.}}\\ &\quad\text{\texttt{// So we count the co-occurrences between the M documents and the K topics }}\mathbf{N}_{M\times K},\\ &\quad\text{\texttt{// and the co-occurrences between the K topics and the V words }}\mathbf{N}_{K\times V},\\ &\quad\text{\texttt{// which yields the Dirichlet posterior of Θ and Φ as follows:}}\\ &= \sum_{\mathbf{Z}}\int_{\mathbf{\Theta}}\int_{\mathbf\Phi} \mathbf{\Theta} \cdot \text{Dir}(\mathbf{\Theta}\mid \vec\alpha+\mathbf{N}_{M\times K}])\cdot \text{Dir}(\mathbf{\Phi}\mid \vec\beta+\mathbf{N}_{K\times V})\cdot P(\mathbf{Z} \mid \mathbf{W})\cdot d\mathbf{\Theta}d\mathbf{\Phi} \\ &\quad\text{\texttt{// Just re-arranging, nothing special...}}\\ &= \sum_{\mathbf{Z}}P(\mathbf{Z} \mid \mathbf{W})\cdot \underbrace{\int_{\mathbf{\Theta}}\mathbf{\Theta} \cdot \text{Dir}(\mathbf{\Theta}\mid \vec\alpha+\mathbf{N}_{M\times K}])d\mathbf{\Theta}}_{=\mathbb{E}_{\mathbf\Theta}[\text{Dir}(\mathbf\Theta\mid\vec\alpha+\mathbf{N}_{M\times K})]}\cdot\underbrace{\int_{\mathbf\Phi} \text{Dir}(\mathbf{\Phi}\mid \vec\beta+\mathbf{N}_{K\times V}) d\mathbf{\Phi} }_{=1}\\ &\quad\text{\texttt{// The first integral is the expectation of }}\mathbf{\Theta}\text{\texttt{, and the second integral sums up to 1.}}\\ &= \sum_{\mathbf{Z}}P(\mathbf{Z} \mid \mathbf{W})\cdot \mathbb{E}_{\mathbf\Theta}[\text{Dir}(\mathbf\Theta\mid\vec\alpha+\mathbf{N}_{M\times K})]\\ &\quad\text{\texttt{// The expectation of }} \text{Dir}(\cdot)\text{\texttt{ is the (normalized) ratio between its parameters. }}\\ &= \underbrace{\sum_{\mathbf{Z}}P(\mathbf{Z} \mid \mathbf{W})}_{\text{intractable}}\cdot \underbrace{\left[\hat{N}_{mk}=\dfrac{\alpha_k+N_{mk}}{\sum_{k'} (\alpha_{k'}+N_{mk'})}\right]_{M\times K}}_{\text{trivial to compute given }\mathbf{Z}} \end{aligned}

The integral w.r.t. Θ,Φ\mathbf\Theta, \mathbf\Phi are feasible to compute once we have Z\mathbf Z – since we are trying to find the posterior of doc-topic and topic-word dices, and that we’ve assumed their priors to be Dirichlet, we can just incorporate our observation – the occurrences of topics in a document and words with a topic, with the pseudo-count – α\vec\alpha and β\vec\beta, to get the parameters of the Dirichlet posterior. And since we know how to compute the expectation of Dirichlet distributions (which is itself, an integral in the same form), these two parts aren’t of much concern.

The real trouble is the summation over all possible values of Z\mathbf Z! Since Z\mathbf{Z} is a (M×N)(M \times N)-dim discrete variable (remember it’s the topic of each word) with KK possible values for each dimension, an exhaustive enumeration would be intractable. Is there any chance we can bypass the exhaustive enumeration over Z\mathbf Z?

The tool we use the intractability, is “sampling”. In computer science, a common practice to model complicated distributions, is to generate samples of it. That is to say, if you are interested in some very complicated distribution p(x)p(\vec{x}), then you probably want to generate a large amount of samples xip(x)\vec x_i \sim p(\vec x). Now if you wish to compute some attributes of the distribution, e.g. its expectation E[p(x)]\mathbb{E}[p(\vec x)], you can simply take the average of your samples as a reasonable approximation. Otherwise, we’d need to take an integral to calculate the expectation precisely, which is typically intractable in practice.

Given nn samples [Z1,Z2,...,Zn][\mathbf Z_1, \mathbf Z_2, ..., \mathbf Z_n] (or, sometimes if our computation resources are really limited, we may even just go with one example, which means n=1n=1) drawn from P(ZW)P(\mathbf{Z}\mid\mathbf{W}), we’d be able to approximate

ZP(ZW)f(Z)1ni=1nf(Zi)\sum_{\mathbf Z} P(\mathbf Z\mid \mathbf W)f(\mathbf Z)\approx \dfrac{1}{n} \sum_{i=1}^n f(\mathbf Z_i)

here, f(Z)f(\mathbf Z) can be any function of Z\mathbf Z. In our calculation of EΘ[P(ΘW)]\mathbb{E}_{\mathbf \Theta}[P(\mathbf\Theta\mid \mathbf W)] above, we have f(Z)f(\mathbf Z) equal to αk+Nmkk(αk+Nmk)\dfrac{\alpha_k+N_{mk}}{\sum_{k'} (\alpha_{k'}+N_{mk'})}.

So the next question becomes how do we generate samples from a distribution, which goes back to the Gibbs sampling section 😃