Daily Archives: January 6, 2010


Notes on Language Models (1)

[toc]

Query Likelihood Language Models

The basic idea behind Query Likelihood Language Models (QLLM) is that a query is a sample drawn from a language model. In other words, we want to compute the likelihood, that given a document language model \theta_{D}, how likely the posed query Q would be used. Formally, this can be expressed as P(Q|\theta_{D}). Two questions will immediately arise following this formulation. (1) How to choose the model to represent \theta_{D}? (2) How to estimate \theta_{D}?

Multinomial Language Model

One popular choice for \theta_{D} is multinomial distribution. The original multinomial distribution is

P(X_{1}=x_{1},X_{2}=x_{2},\ldots,X_{n}=x_{n})=\frac{n!}{x_{1}!x_{2}!,\ldots,x_{n}!}p_{1}^{x_{1}}p_{2}^{x_{2}},\ldots, p_{n}^{x_{n}}

In Language Modeling, we usually ignore the coefficient and therefore we obtain unigram language model so that the order of text sequence is not important. Use multinomial distribution to model \theta_{D}, we obtain

P(Q|\theta_{D}) = \prod_{w \in Q} P(w | \theta_{D}) =\prod_{w \in V} P(w | \theta_{D})^{c(w,Q)}

where c(w,Q) is the number of times that term w appearing in query Q. Now, the problem becomes to estimate P(w|\theta_{D}). In theory, we need to estimate it for all the terms in our vocabulary. However, since c(w,Q) could be 0 (meaning that term w does not show up in the query), we only care about the terms in the query.

The key point here is that we do not know \theta_{D}}! How can we calculate P(w|\theta_{D}) for the terms in the query when we really do not have a model in hand? One way is to use document D as a sample to estimate \theta_{D}. Therefore, essentially, we choose the model \theta_{D} such that \hat{\theta_{D}}= arg max_{\theta} P(D|\theta).

One simple way to estimate P(D|\theta) is through Maximum Likelihood Estimator (ML). Now, let us to derive ML for Multinomial Language Model. First, we usually work on log-likelihood rather than the product of probabilities just to avoid very small numbers: \log P(D|\theta) = \sum_{w \in V} c(w,D) \log P(w|\theta). Then, use Lagrange Multiplers, we obtain:

L =\log P(D|\theta) + \lambda (1-\sum_{w \in V} P(w|\theta))

Take all derivatives respect to P(w|\theta) and \lambda:

\frac{\partial L}{\partial P(w|\theta)} = \frac{c(w,D)}{P(w|\theta)}-\lambda = 0
\frac{\partial L}{\partial \lambda } = 1 - \sum_{w \in V} P(w|\theta) = 0

From the first equation, we can get P(w|\theta) = \frac{c(w,D)}{\lambda} and also \sum_{w \in V} c(w,D) = \lambda \sum_{w \in V} P(w|\theta)=1. Therefore,

P(w|\theta) = \frac{c(w,D)}{\sum_{w \in V} c(w,D)} =\frac{c(w,D)}{|D|}