Notes on Probabilistic Latent Semantic Analysis (PLSA) 1


PLEASE NOTE: THIS PAGE IS AND WILL NOT GET UPDATED.

I highly recommend you read the more detailed version of http://arxiv.org/abs/1212.3900

Formulation of PLSA

There are two ways to formulate PLSA. They are equivalent but may lead to different inference process.

  1. P(d,w) = P(d) \sum_{z} P(w|z)P(z|d)
  2. P(d,w) = \sum_{z} P(w|z)P(d|z)P(z)

Let’s see why these two equations are equivalent by using Bayes rule.

P(z|d) = \frac{P(d|z)P(z)}{P(d)}
P(z|d)P(d) =P(d|z)P(z)
P(w|z)P(z|d)P(d) =P(w|z)P(d|z)P(z)
P(d) \sum_{z} P(w|z)P(z|d) = \sum_{z} P(w|z)P(d|z)P(z)

The whole data set is generated as (we assume that all words are generated independently):

D = \prod_{d} \prod_{w} P(d,w)^{n(d,w)}

The Log-likelihood of the whole data set for (1) and (2) are:

L_{1} = \sum_{d} \sum_{w} n(d,w) \log [ P(d) \sum_{z} P(w|z)P(z|d) ]

L_{2} = \sum_{d} \sum_{w} n(d,w) \log [ \sum_{z} P(w|z)P(d|z)P(z) ]

EM

For L_{1} or L_{2}, the optimization is hard due to the log of sum. Therefore, an algorithm called Expectation-Maximization is usually employed. Before we introduce anything about EM, please note that EM is only guarantee to find a local optimum (although it may be a global one).

First, we see how EM works in general. As we shown for PLSA, we usually want to estimate the likelihood of data, namely P(X|\theta), given the paramter \theta. The easiest way is to obtain a maximum likelihood estimator by maximizing P(X|\theta). However, sometimes, we also want to include some hidden variables which are usually useful for our task. Therefore, what we really want to maximize is P(X|\theta)=\sum_{z}P(X|z,\theta)P(z|\theta), the complete likelihood. Now, our attention becomes to this complete likelihood. Again, directly maximizing this likelihood is usually difficult. What we would like to show here is to obtain a lower bound of the likelihood and maximize this lower bound.

We need Jensen’s Inequality to help us obtain this lower bound. For any convex function f(x), Jensen’s Inequality states that :

\lambda f(x) + (1-\lambda) f(y) \geq f(\lambda x + (1-\lambda) y)

Thus, it is not difficult to show that :

E[f(x)] = \sum_{x} P(x) f(x) \geq f(\sum_{x} P(x) x) = f(E[x])

and for concave functions (like logarithm), it is :

E[f(x)] \leq f(E[x])

Back to our complete likelihood, we can obtain the following conclusion by using concave version of Jensen’s Inequality :

 \log \sum_{z}P(X|z,\theta)P(z|\theta)= \log \sum_{z}P(X|z,\theta)P(z|\theta)\frac{q(z)}{q(z)}
 = \log E[\frac{P(X|z,\theta)P(z|\theta)}{q(z)}]
 \geq E[\log \frac{P(X|z,\theta)P(z|\theta)}{q(z)}]

Therefore, we obtained a lower bound of complete likelihood and we want to maximize it as tight as possible. EM is an algorithm that maximize this lower bound through a iterative fashion. Usually, EM first would fix current \theta value and maximize q(z) and then use the new q(z) value to obtain a new guess on \theta, which is essentially a two stage maximization process. The first step can be shown as follows:

E[\log \frac{P(X|z,\theta)P(z|\theta)}{q(z)}] = \sum_{z} q(z) \log \frac{P(X|z,\theta)P(z|\theta)}{q(z)}
= \sum_{z} q(z) \log \frac{P(z|X,\theta)P(X,\theta)}{q(z)}
= \sum_{z} q(z) \log P(x,\theta) + \sum_{z} q(z) \log \frac{P(z|X,\theta)}{q(z)}
= \log P(x,\theta) - \sum_{z} q(z) \log \frac{q(z)}{P(z|X,\theta)}
= \log P(x,\theta) - KL(q(z) || P(z|X,\theta))

The first term is the same for all z. Therefore, in order to maximize the whole equation, we need to minimize KL divergence between q(z) and P(z|X,\theta), which eventually leads to the optimum solution of q(z) = P(z|X,\theta). So, usually for E-step, we use current guess of \theta to calculate the posterior distribution of hidden variable as the new update score. For M-step, it is problem-dependent. We will see how to do that in later discussions.

Another explanation of EM is in terms of optimizing a so-called Q function. We devise the data generation process as P(X|\theta)=P(X,H|\theta)=P(H|X,\theta)P(X|\theta). Therefore, the complete likelihood is modified as:

L_{c}(\theta) = \log P(X,H|\theta) = \log P(X|\theta) + \log P(H|X,\theta) = L(\theta) + \log P(H|X,\theta)

Think about how to maximize L_{c}(\theta). Instead of directly maximizing it, we can iteratively maximize L_{c}(\theta^{(n+1)})-L_{c}(\theta^{(n)}) as :

L(\theta) - L(\theta^{(n)}) = L_{c}(\theta) - \log P(H|X,\theta) - L_{c}(\theta^{(n)}) + \log P(H|X,\theta^{(n)})
= L_{c}(\theta) - L_{c}(\theta^{(n)}) + \log \frac{P(H|X,\theta^{(n)})}{P(H|X,\theta)}

Now take the expectation of this equation, we have:

L(\theta) - L(\theta^{(n)}) = \sum_{H} L_{c}(\theta)P(H|X,\theta^{(n)}) - \sum_{H} L_{c}(\theta^{(n)})P(H|X,\theta^{(n)}) + \sum_{H} P(H|X,\theta^{(n)})\log \frac{P(H|X,\theta^{(n)})}{P(H|X,\theta)}

The last term is always non-negative since it can be recognized as the KL-divergence of P(H|X,\theta^{(n)} and P(H|X,\theta). Therefore, we obtain a lower bound of Likelihood :

L(\theta) \geq \sum_{H} L_{c}(\theta)P(H|X,\theta^{(n)}) + L(\theta^{(n)}) - \sum_{H} L_{c}(\theta^{(n)})P(H|X,\theta^{(n)})

The last two terms can be treated as constants as they do not contain the variable \theta, so the lower bound is essentially the first term, which is also sometimes called as “Q-function”.
Q(\theta;\theta^{(n)}) = E(L_{c}(\theta)) = \sum_{H} L_{c}(\theta) P (H|X,\theta^{(n)})

EM of Formulation 1

In case of Formulation 1, let us introduce hidden variables R(z,w,d) to indicate which hidden topic z is selected to generated w in d (\sum_{z} R(z,w,d) = 1). Therefore, the complete likelihood can be formulated as :

L_{c1} = \sum_{d} \sum_{w} n(d,w) \sum_{z} R(z,w,d) \log [ P(d) P(w|z)P(z|d) ]
= \sum_{d} \sum_{w} n(d,w) \sum_{z} R(z,w,d) [ \log P(d) + \log P(w|z) + \log P(z|d) ]

From the equation above, we can write our Q-function for the complete likelihood E[L_{c1}]:

 E[L_{c1}] = \sum_{d} \sum_{w} n(d,w) \sum_{z} P(z|w,d) [ \log P(d) + \log P(w|z) + \log P(z|d) ]

For E-step, simply using Bayes Rule, we can obtain:

P(z|w,d) = \frac{P(w|z,d)}{P(w,d)}
= \frac{P(w|z)P(z|d)P(d)}{\sum_{z} P(w|z)P(z|d)P(d)}
= \frac{P(w|z)P(z|d)}{\sum_{z} P(w|z)P(z|d)}

For M-step, we need to maximize Q-function, which needs to be incorporated with other constraints:

H = E[L_{c1}]+ \alpha [1-\sum_{d} P(d) ]+ \beta \sum_{z}[1- \sum_{w} P(w|z)]
+\gamma \sum_{d}[1- \sum_{z} P(z|d)]

and take all derivatives:

\frac{\partial H}{\partial P(d)} = \sum_{w} \sum_{z} n(d,w) \frac{P(z|w,d)}{P(d)} - \alpha = 0
\rightarrow \sum_{w} \sum_{z} n(d,w) P(z|w,d) - \alpha P(d) = 0
\frac{\partial H}{\partial P(w|z)} = \sum_{d} n(d,w) \frac{P(z|w,d)}{P(w|z)} - \beta = 0
\rightarrow \sum_{d} n(d,w) P(z|w,d) - \beta P(w|z) = 0
\frac{\partial H}{\partial P(z|d)} = \sum_{w} n(d,w) \frac{P(z|w,d)}{P(z|d)} - \gamma = 0
\rightarrow \sum_{w} n(d,w) P(z|w,d) - \gamma P(z|d) = 0

Therefore, we can easily obtain:

P(d) = \frac{\sum_{w} \sum_{z} n(d,w) P(z|w,d)}{\sum_{d} \sum_{w} \sum_{z} n(d,w) P(z|w,d)}
= \frac{n(d)}{\sum_{d} n(d)}
P(w|z) = \frac{\sum_{d} n(d,w) P(z|w,d)}{\sum_{w} \sum_{d} n(d,w) P(z|w,d) }
P(z|d) = \frac{\sum_{w} n(d,w) P(z|w,d)}{\sum_{z} \sum_{w} n(d,w) P(z|w,d) }
= \frac{\sum_{w} n(d,w) P(z|w,d)}{n(d)}

EM of Formulation 2

Use similar method to introduce hidden variables to indicate which z is selected to generated w and d and we can have the following complete likelihood :

L_{c2} = \sum_{d} \sum_{w} n(d,w) \sum_{z} R(z,w,d) \log [ P(z) P(w|z)P(d|z) ]
= \sum_{d} \sum_{w} n(d,w) \sum_{z} R(z,w,d) [ \log P(z) + \log P(w|z) + \log P(d|z) ]

Therefore, the Q-function E[L_{c2}] would be :

E[L_{c2}] = \sum_{d} \sum_{w} n(d,w) \sum_{z} P(z|w,d) [ \log P(z) + \log P(w|z) + \log P(d|z) ]

For E-step, again, simply using Bayes Rule, we can obtain:

P(z|w,d) = \frac{P(w|z,d)}{P(w,d)}
= \frac{P(w|z)P(d|z)P(z)}{\sum_{z} P(w|z)P(d|z)P(z)}

For M-step, we maximize the constraint version of Q-function:

H = E[L_{c2}] + \alpha [1-\sum_{z} P(z) ] + \beta \sum_{z}[1- \sum_{w} P(w|z)]+
+\gamma \sum_{z} [1- \sum_{d} P(d|z)]

and take all derivatives:

\frac{\partial H}{\partial P(z)}= \sum_{d} \sum_{w} n(d,w) \frac{P(z|w,d)}{P(z)} - \alpha = 0
\rightarrow \sum_{d} \sum_{w} n(d,w) P(z|w,d) - \alpha P(z)= 0

 [latex]\rightarrow \sum_{d} n(d,w) P(z|w,d) - \beta P(w|z) = 0

\frac{\partial H}{\partial P(d|z)} = \sum_{w} n(d,w) \frac{P(z|w,d)}{P(d|z)} - \gamma = 0
\rightarrow \sum_{w} n(d,w) P(z|w,d) - \gamma P(d|z) = 0

Therefore, we can easily obtain:

P(z) = \frac{\sum_{d} \sum_{w} n(d,w) P(z|w,d)}{\sum_{d} \sum_{w} \sum_{z} n(d,w) P(z|w,d)}
= \frac{\sum_{d} \sum_{w} n(d,w) P(z|w,d)}{\sum_{d} \sum_{w} n(d,w)}
P(w|z)= \frac{\sum_{d} n(d,w) P(z|w,d)}{\sum_{w} \sum_{d} n(d,w) P(z|w,d) }
P(d|z) = \frac{\sum_{w} n(d,w) P(z|w,d)}{\sum_{d} \sum_{w} n(d,w) P(z|w,d) }


Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

One thought on “Notes on Probabilistic Latent Semantic Analysis (PLSA)

  • TANMAY GUPTA

    I have been trying to understand pLSA for some time now and your post really helped me. Infact this is the only place where I found how the maximization step is actually done in pLSA. Thanks a lot again!!
    I also have a blog of my own which I have recently started where I write about stuff that I am learning related to Computer Vision. Please leave a comment there if you like it.