Notes on Language Models (1)
[toc]
Query Likelihood Language Models
The basic idea behind Query Likelihood Language Models (QLLM) is that a query is a sample drawn from a language model. In other words, we want to compute the likelihood, that given a document language model , how likely the posed query would be used. Formally, this can be expressed as . Two questions will immediately arise following this formulation. (1) How to choose the model to represent ? (2) How to estimate ?
Multinomial Language Model
One popular choice for is multinomial distribution. The original multinomial distribution is
In Language Modeling, we usually ignore the coefficient and therefore we obtain unigram language model so that the order of text sequence is not important. Use multinomial distribution to model , we obtain
where is the number of times that term appearing in query . Now, the problem becomes to estimate . In theory, we need to estimate it for all the terms in our vocabulary. However, since could be (meaning that term does not show up in the query), we only care about the terms in the query.
The key point here is that we do not know }! How can we calculate for the terms in the query when we really do not have a model in hand? One way is to use document as a sample to estimate . Therefore, essentially, we choose the model such that .
One simple way to estimate is through Maximum Likelihood Estimator (ML). Now, let us to derive ML for Multinomial Language Model. First, we usually work on log-likelihood rather than the product of probabilities just to avoid very small numbers: . Then, use Lagrange Multiplers, we obtain:
Take all derivatives respect to and :
From the first equation, we can get and also . Therefore,