[toc]

Query Likelihood Language Models

The basic idea behind Query Likelihood Language Models (QLLM) is that a query is a sample drawn from a language model. In other words, we want to compute the likelihood, that given a document language model $\theta_{D}$ , how likely the posed query $Q$ would be used. Formally, this can be expressed as $P(Q|\theta_{D})$ . Two questions will immediately arise following this formulation. (1) How to choose the model to represent $\theta_{D}$ ? (2) How to estimate $\theta_{D}$ ?

Multinomial Language Model

One popular choice for $\theta_{D}$ is multinomial distribution. The original multinomial distribution is

$P(X_{1}=x_{1},X_{2}=x_{2},\ldots,X_{n}=x_{n})=\frac{n!}{x_{1}!x_{2}!,\ldots,x_{n}!}p_{1}^{x_{1}}p_{2}^{x_{2}},\ldots, p_{n}^{x_{n}}$

In Language Modeling, we usually ignore the coefficient and therefore we obtain unigram language model so that the order of text sequence is not important. Use multinomial distribution to model $\theta_{D}$ , we obtain

$P(Q|\theta_{D}) = \prod_{w \in Q} P(w | \theta_{D}) =\prod_{w \in V} P(w | \theta_{D})^{c(w,Q)}$

where $c(w,Q)$ is the number of times that term $w$ appearing in query $Q$ . Now, the problem becomes to estimate $P(w|\theta_{D})$ . In theory, we need to estimate it for all the terms in our vocabulary. However, since $c(w,Q)$ could be $0$ (meaning that term $w$ does not show up in the query), we only care about the terms in the query.

The key point here is that we do not know $\theta_{D}$ }! How can we calculate $P(w|\theta_{D})$ for the terms in the query when we really do not have a model in hand? One way is to use document $D$ as a sample to estimate $\theta_{D}$ . Therefore, essentially, we choose the model $\theta_{D}$ such that $\hat{\theta_{D}}= arg max_{\theta} P(D|\theta)$ .

One simple way to estimate $P(D|\theta)$ is through Maximum Likelihood Estimator (ML). Now, let us to derive ML for Multinomial Language Model. First, we usually work on log-likelihood rather than the product of probabilities just to avoid very small numbers: $\log P(D|\theta) = \sum_{w \in V} c(w,D) \log P(w|\theta)$ . Then, use Lagrange Multiplers, we obtain:

$L =\log P(D|\theta) + \lambda (1-\sum_{w \in V} P(w|\theta))$

Take all derivatives respect to $P(w|\theta)$ and $\lambda$ :

$\frac{\partial L}{\partial P(w|\theta)} = \frac{c(w,D)}{P(w|\theta)}-\lambda = 0$
$\frac{\partial L}{\partial \lambda } = 1 - \sum_{w \in V} P(w|\theta) = 0$

From the first equation, we can get $P(w|\theta) = \frac{c(w,D)}{\lambda}$ and also $\sum_{w \in V} c(w,D) = \lambda \sum_{w \in V} P(w|\theta)=1$ . Therefore,

$P(w|\theta) = \frac{c(w,D)}{\sum_{w \in V} c(w,D)} =\frac{c(w,D)}{|D|}$

Hong, LiangJie

Director of Engineering – AI at LinkedIn Corporation

Director of Engineering – AI at LinkedIn Corporation

Notes on Language Models (1)

Query Likelihood Language Models

Multinomial Language Model

Leave a comment