Posted on Aug 17, 2019
In recent years, variational auto-encoders (VAEs) have become increasingly ubiquitous for unsupervised learning (Kingma and Welling, 2013). The building blocks of VAE employ traditional neural networks to learn the representation of the input; they can exploit stochastic gradient descent training procedure. It comprises two modules: the encoder and the decoder. The encoder maps the input, \(\mathbf{x}\) to the latent space, \(\mathbf{z}\) (which captures the representation). The decoder reconstructs the input using the latent space.
Let \(\mathbf{z} = z_{1:m}\) denote the latent variables and \(\mathbf{x} = x_{1:n}\) denote the observations. The generative process for \(\mathbf{x}\) is \begin{align} p_{\theta}(\mathbf{x}) = \int p_{\theta}(\mathbf{x} | \mathbf{z})\,p_{\theta}(\mathbf{z})\, \text{d}\mathbf{z} \end{align} where, \(\theta\) represents trainable parameters of the neural network. The framework utilizes maximum likelihood principle to generate samples similar to the already observed training data. The output distribution (of the generated samples) chosen is generally Gaussian (for mathematical convenience), i.e., \(p(\mathbf{x} | \mathbf{z}, \theta) = \mathcal{N}(\mathbf{x}\, |\, f_{\theta}(\mathbf{z}),\, \sigma^2 \mathbb{I})$\). The mean, \(f_{\theta}(\mathbf{z})\) is a modeled with a neural network and the covariance is identity matrix, \(\mathbb{I}\) times a hyper parameter \(\sigma \in \mathbb{R}_{> 0}\).
In the vanilla implementation of the VAE, a standard Gaussian prior is assumed on each latent variable, \(z\). We introduce the approximate posterior distribution (also known as recognition model) of the latent space to be diagonal Gaussian, i.e., \(q_{\phi}(\mathbf{z}) = \mathcal{N}(\mathbf{z}\, |\, \mathbf{\mu}_{\phi}(\mathbf{x}),\, \mathbf{\sigma}_{\phi}^2(\mathbf{x}))\). Figure (on the right) shows the graphical model of a variational auto-encoder where the probability distributions are parameterized by \(\theta\) (neural networks) and the variational distribution, parameterized by \(\phi\). Following the derivation from previous section, ELBO is \begin{align}\label{eq:vae_objective} \mathcal{L}(\theta, \phi; \mathbf{x}) = - \mathcal{KL} \left( q_{\phi}(\mathbf{z} | \mathbf{x})\, ||\, p_{\theta}(\mathbf{z}) \right) + \mathbb{E}_{q_{\phi}(\mathbf{z} | \mathbf{x})} \left[ \log\,p_{\theta}(\mathbf{x} | \mathbf{z}) \right] \end{align} The gradient with respect to the variational parameters, \(\phi\) is approximated using Monte Carlo gradient estimator. The second term in the above equation is perceived as the expected reconstruction error while the first term is interpreted as the regularization term that forces the variational distribution to approach the prior. When both the distributions are Gaussians, the Kullback-Leibler divergence has a closed form, \begin{align} \mathcal{KL} \left( q(\mathbf{z})\, ||\, p(\mathbf{z}) \right) &= \int q(\mathbf{z}) \left(\log p(\mathbf{z}) - q(\mathbf{z})\log(q(\mathbf{z})) \right) \text{d}\mathbf{z}\\ &= \int \mathcal{N}(\mathbf{z}\, |\, \mathbf{\mu}, \mathbf{\sigma}^2) \log \mathcal{N}(\mathbf{z}\, |\, \mathbf{0}, \mathbf{1}) \text{d}\mathbf{z} - \int \mathcal{N}(\mathbf{z}\, |\, \mathbf{\mu}, \mathbf{\sigma}^2) \log \mathcal{N}(\mathbf{z}\, |\, \mathbf{\mu}, \mathbf{\sigma}^2) \text{d}\mathbf{z} \\ &= - \frac{M}{2} \log(2 \pi) - \frac{1}{2} \sum_{m=1}^M (\mu_m^2 + \sigma_m^2) + \frac{M}{2} \log(2\pi) + \frac{1}{2} \sum_{m=1}^M(1 + \log \sigma_m^2) \\ &= - \frac{1}{2} \sum_{m=1}^M \left( 1 + 2 \log \sigma_m - \mu_m^2 - \sigma_m^2 \right) \end{align} where \(M\) is the dimensionality of the latent space. Although the expectation of log likelihood can be approximated using MC estimates, we cannot use backpropagation through samples. This issue is addressed with the reparameterization trick (Rezende et al., 2014; Kingma andWelling, 2013) by moving the sampling to an input layer - this makes the sample a differentiable transformation of a fixed random source. A sample from \(\mathcal{N}(\mathbf{z}\, |\, \mu_{\phi}(\mathbf{x}),\, \sigma_{\phi}(\mathbf{x}))\) can be generated thus: \begin{align} \begin{split} \epsilon &\sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ \mathbf{z} &= \mu_{\phi} + \sigma_{\phi} \odot \epsilon \end{split} \end{align} \(\odot\) denotes element-wise multiplication. This technique moves the randomness from the latent variable \(\mathbf{z}\) to \(\epsilon\), which does not depend on parameter \(\phi\), thus allowing gradient computation.