Maximum Likelihood as Minimize KL-Divergence 8
When I study Product of Experts and Restricted Boltzmann Machine recently, I have found a very interesting technical point to be demonstrated. That is maximum likelihood estimation can be viewed as a process to minimize a KL-Divergence between two distributions.
We start from a simple notation. Let \( P(x_{i} \, | \, \boldsymbol{\Theta})\) be the distribution for generating each data point \( x_{i} \). We can define the model parameter distribution as:
\begin{equation}
P_{\boldsymbol{\Theta}}(x) = P(x \, | \, \boldsymbol{\Theta})
\end{equation}We define the following distribution as the “empirical data distribution”:
\begin{equation}
P_{D}(x) = \sum_{i=1}^{N} \frac{1}{N}\delta(x – x_{i})\label{eq:data_distribution}
\end{equation}where \(\delta\) is the Dirac delta function. We can verify that this is a valid distribution by summing over all \( x\): \(\sum_{j=1}^{N}P_{D}(x_{j}) = 1\). By using this empirical data distribution, we wish to calculate the following KL divergence:
\begin{align}
\mathrm{KL}[ P_{D}(x) \, || \, P_{\boldsymbol{\Theta}}(x)] &= \int P_{D}(x) \log \frac{P_{D}(x)}{P_{\boldsymbol{\Theta}}(x)} \, dx \\
&= \int P_{D}(x) \log P_{D}(x) \, dx – \int P_{D}(x) \log P_{\boldsymbol{\Theta}}(x) \, dx \\
&= -H[P_{D}(x)] – \int P_{D}(x) \log P_{\boldsymbol{\Theta}}(x) \, dx \\
&\propto – \Bigr< \log P_{\boldsymbol{\Theta}}(x) \Bigl>_{P_{D}(x)}
\end{align}The last line comes when we drop \(-H[P_{D}(x)]\), which has nothing to do with parameters \(\boldsymbol{\Theta}\) and \(\Bigr< \Bigl>_{P_{D}(x)}\) represents the expectation over \(P_{D}(x)\). If we minimize the left hand side KL divergence, it is equivalent to minimize the negative expectation on the right hand side. This expectation is indeed:
\begin{align}
\Bigr< \log P_{\boldsymbol{\Theta}}(x) \Bigl>_{P_{D}(x)} &= \sum_{x} P_{D}(x) \log P(x \, | \, \boldsymbol{\Theta}) \\
&= \sum_{x} \Bigr[ \frac{1}{N}\sum_{i=1}^{N}\delta(x – x_{i}) \Bigl] \log P(x \, | \, \boldsymbol{\Theta}) \\
&= \frac{1}{N} \sum_{i=1}^{N} \log P(x_{i} \, | \, \boldsymbol{\Theta})
\end{align}This is indeed log likelihood of the dataset.