Two important distributions related to Dirichlet distribution

In this post, we review two important facts of Dirichlet distribution. The basic setting used here as follows. Suppose we have a discrete space $latex \mathcal{X} = \{ \mathcal{X}_{1},\mathcal{X}_{2}, \ldots,\mathcal{X}_{K} \}$. These are $latex K$ outcomes can be observed from the “random experiments”. Suppose we have $latex N$ observations $latex \{X_{1}, X_{2},\ldots, X_{N}\}$ that are distributed according to a Multinomial distribution $latex \theta$ where $latex \theta_{i} = P(X_{j} = \mathcal{X}_{i})$. After placing a Dirichlet distribution $latex \mbox{Dir}(\alpha)$ on $latex \theta$, we are interested in the following two questions:

What is the posterior distribution $P(\theta | \mathbf{X}, \alpha)$ ?
What is the predictive distribution $P(X_{N+1} | \mathbf{X}, \alpha)$ ?

Posterior Distribution

We can steadily derive the posterior distribution as below:

$P(\theta | \mathbf{X}, \alpha) = \frac{P(X|\theta)P(\theta|\alpha)}{\int_{\theta} P(X|\theta)P(\theta|\alpha) \, d\theta} \\ = \frac{\Bigr( \prod_{k=1}^{K} (\theta_{k})^{n(k)} \Bigl) \Bigr( \frac{\Gamma(\sum_{k} \alpha_{k})}{\prod_{k} \Gamma(\alpha_{k})} \prod_{k}^{K} \theta_{k}^{\alpha_{k} - 1} \Bigl) }{\int_{\theta} \Bigr( \prod_{k=1}^{K} (\theta_{k})^{n(k)} \Bigl) \Bigr( \frac{\Gamma(\sum_{k} \alpha_{k})}{\prod_{k} \Gamma(\alpha_{k})} \prod_{k}^{K} \theta_{k}^{\alpha_{k} - 1} \Bigl) \, d\theta } \\ = \frac{\Gamma \Bigr( \sum_{k} ( \alpha_{k} + n(k) ) \Bigl)}{\prod_{k}^{K} \Gamma \Bigr( \alpha_{k} + n(k) \Bigl)} \prod_{k}^{K} \theta_{k}^{\alpha_{k} + n(k) -1 }$

where $latex n(k)$ is the number of times outcome $latex \mathcal{X}_{k}$ appear in the data. Therefore, the posterior distribution is indeed a Dirichlet distribution $latex \mbox{Dir} \Bigr( \alpha_{1}+n(1),\alpha_{2}+n(2),\ldots,\alpha_{k}+n(k) \Bigl)$. If we only have one observation $latex X$, this posterior distribution is simply $latex \mbox{Dir} \Bigr( \alpha_{1}+\delta_{1}(X),\alpha_{2}+\delta_{2}(X),\ldots,\alpha_{k}+\delta_{k}(X) \Bigl)$ where $latex \delta_{k}(X)$ is $latex 1$ only if $latex X$ takes the value of outcome $latex \mathcal{X}_{k}$. This notation can also extend to the general case that the posterior distribution is $latex \mbox{Dir} \Bigr( \alpha_{1}+\sum_{i}^{N} \delta_{1}(X_{i}),\alpha_{2}+\sum_{i}^{N} \delta_{2}(X_{i}),\ldots,\alpha_{k}+\sum_{i}^{N} \delta_{k}(X_{i}) \Bigl)$.

Predictive Distribution

By using the posterior distribution derived above, we can have the predictive distribution as follows:

$P(X_{N+1} = \mathcal{X}_{i} | \mathbf{X}, \alpha) = \int_{\theta} P(X_{N+1} = \mathcal{X}_{i} | \theta) P(\theta | \mathbf{X}, \alpha) \, d\theta \\ =\int_{\theta} \theta_{i} \Biggr[ \frac{\Gamma \Bigr( \sum_{k} ( \alpha_{k} + n(k) ) \Bigl)}{\prod_{k}^{K} \Gamma \Bigr( \alpha_{k} + n(k) \Bigl)} \prod_{k}^{K} \theta_{k}^{\alpha_{k} + n(k) -1 } \Biggl] \, d\theta \\ = \mathbb{E}[\theta_{i}] \\ = \frac{\alpha_{i} + n(i)}{\sum_{k} \Bigr( \alpha_{k} + n(k) \Bigl)} = \frac{\alpha_{i} + \sum_{j}^{N} \delta_{i}(X_{j})}{\sum_{k} \alpha_{k} + N }$

where the second last line is derived by observing that the posterior distribution is a Dirichlet distribution and the expression is essentially an expectation under that distribution. Note, the final line of the equation can be re-written into:

$\frac{\alpha_{i} + \sum_{j}^{N} \delta_{i}(X_{j})}{\sum_{k} \alpha_{k} + N } = \Bigr( \frac{\alpha_{i}}{\sum_{k} \alpha_{k}} \Bigl) \Bigr( \frac{\sum_{k} \alpha_{k}}{\sum_{k} \alpha_{k} + N} \Bigl) + \Bigr( \frac{1}{N} \sum_{j}^{N} \delta_{i}(X_{j}) \Bigl) \Bigr( \frac{N}{\sum_{k}\alpha_{k} + N }\Bigl)$

It is the weighted summation of prior mean of $latex \theta_{i}$ and the MLE of $latex \theta_{i}$. There is one extreme case for predictive distribution, which is that there is no data points before. By using the equation we derive above, we have:

$P(X_{N+1} = \mathcal{X}_{i} | \alpha) = \frac{\alpha_{i}}{\sum_{k} \alpha_{k}}$

These two distributions are heavily used in Dirichlet/Multinomial Bayesian modeling and also are milestones for understanding Dirichlet Process.

Hong, LiangJie

Director of Engineering – AI at LinkedIn Corporation

Director of Engineering – AI at LinkedIn Corporation

Two important distributions related to Dirichlet distribution

Posterior Distribution

Predictive Distribution

Leave a comment