Daily Archives: November 3, 2010


Two important distributions related to Dirichlet distribution

In this post, we review two important facts of Dirichlet distribution. The basic setting used here as follows. Suppose we have a discrete space $latex \mathcal{X} = \{ \mathcal{X}_{1},\mathcal{X}_{2}, \ldots,\mathcal{X}_{K} \}$. These are $latex K$ outcomes can be observed from the “random experiments”. Suppose we have $latex N$ observations $latex \{X_{1}, X_{2},\ldots, X_{N}\}$ that are distributed according to a Multinomial distribution $latex \theta$ where $latex \theta_{i} = P(X_{j} = \mathcal{X}_{i})$. After placing a Dirichlet distribution $latex \mbox{Dir}(\alpha)$ on $latex \theta$, we are interested in the following two questions:

  • What is the posterior distribution P(\theta | \mathbf{X}, \alpha)?
  • What is the predictive distribution  P(X_{N+1} | \mathbf{X}, \alpha)?

Posterior Distribution

We can steadily derive the posterior distribution as below:

 P(\theta | \mathbf{X}, \alpha) = \frac{P(X|\theta)P(\theta|\alpha)}{\int_{\theta} P(X|\theta)P(\theta|\alpha) \, d\theta} \\ = \frac{\Bigr( \prod_{k=1}^{K} (\theta_{k})^{n(k)} \Bigl) \Bigr( \frac{\Gamma(\sum_{k} \alpha_{k})}{\prod_{k} \Gamma(\alpha_{k})} \prod_{k}^{K} \theta_{k}^{\alpha_{k} - 1} \Bigl) }{\int_{\theta} \Bigr( \prod_{k=1}^{K} (\theta_{k})^{n(k)} \Bigl) \Bigr( \frac{\Gamma(\sum_{k} \alpha_{k})}{\prod_{k} \Gamma(\alpha_{k})} \prod_{k}^{K} \theta_{k}^{\alpha_{k} - 1} \Bigl) \, d\theta } \\ = \frac{\Gamma \Bigr( \sum_{k} ( \alpha_{k} + n(k) ) \Bigl)}{\prod_{k}^{K} \Gamma \Bigr( \alpha_{k} + n(k) \Bigl)} \prod_{k}^{K} \theta_{k}^{\alpha_{k} + n(k) -1 }

where $latex n(k)$ is the number of times outcome $latex \mathcal{X}_{k}$ appear in the data. Therefore, the posterior distribution is indeed a Dirichlet distribution $latex \mbox{Dir} \Bigr( \alpha_{1}+n(1),\alpha_{2}+n(2),\ldots,\alpha_{k}+n(k) \Bigl)$. If we only have one observation $latex X$, this posterior distribution is simply $latex \mbox{Dir} \Bigr( \alpha_{1}+\delta_{1}(X),\alpha_{2}+\delta_{2}(X),\ldots,\alpha_{k}+\delta_{k}(X) \Bigl)$ where $latex \delta_{k}(X)$ is $latex 1$ only if $latex X$ takes the value of outcome $latex \mathcal{X}_{k}$. This notation can also extend to the general case that the posterior distribution is $latex \mbox{Dir} \Bigr( \alpha_{1}+\sum_{i}^{N} \delta_{1}(X_{i}),\alpha_{2}+\sum_{i}^{N} \delta_{2}(X_{i}),\ldots,\alpha_{k}+\sum_{i}^{N} \delta_{k}(X_{i}) \Bigl)$.

Predictive Distribution

By using the posterior distribution derived above, we can have the predictive distribution as follows:

 P(X_{N+1} = \mathcal{X}_{i} | \mathbf{X}, \alpha) = \int_{\theta} P(X_{N+1} = \mathcal{X}_{i} | \theta) P(\theta | \mathbf{X}, \alpha) \, d\theta \\ =\int_{\theta} \theta_{i} \Biggr[ \frac{\Gamma \Bigr( \sum_{k} ( \alpha_{k} + n(k) ) \Bigl)}{\prod_{k}^{K} \Gamma \Bigr( \alpha_{k} + n(k) \Bigl)} \prod_{k}^{K} \theta_{k}^{\alpha_{k} + n(k) -1 } \Biggl] \, d\theta \\ = \mathbb{E}[\theta_{i}] \\ = \frac{\alpha_{i} + n(i)}{\sum_{k} \Bigr( \alpha_{k} + n(k) \Bigl)} = \frac{\alpha_{i} + \sum_{j}^{N} \delta_{i}(X_{j})}{\sum_{k} \alpha_{k} + N }

where the second last line is derived by observing that the posterior distribution is a Dirichlet distribution and the expression is essentially an expectation under that distribution. Note, the final line of the equation can be re-written into:

 \frac{\alpha_{i} + \sum_{j}^{N} \delta_{i}(X_{j})}{\sum_{k} \alpha_{k} + N } = \Bigr( \frac{\alpha_{i}}{\sum_{k} \alpha_{k}} \Bigl) \Bigr( \frac{\sum_{k} \alpha_{k}}{\sum_{k} \alpha_{k} + N} \Bigl) + \Bigr( \frac{1}{N} \sum_{j}^{N} \delta_{i}(X_{j}) \Bigl) \Bigr( \frac{N}{\sum_{k}\alpha_{k} + N }\Bigl)

It is the weighted summation of prior mean of $latex \theta_{i}$ and the MLE of $latex \theta_{i}$. There is one extreme case for predictive distribution, which is that there is no data points before. By using the equation we derive above, we have:

 P(X_{N+1} = \mathcal{X}_{i} | \alpha) = \frac{\alpha_{i}}{\sum_{k} \alpha_{k}}

These two distributions are heavily used in Dirichlet/Multinomial Bayesian modeling and also are milestones for understanding Dirichlet Process.