All Gauss Everything

Multivariate Normal Distribution

A random vector is Gaussian distributed \(X \sim \N(\mu, \Sigma)\) with mean \(\mu \in \R^k\) and positive definite covariance matrix \(\Sigma \in \R^{k \times k}\) if its pdf is \[f_X(x) = \frac{1}{(2\pi)^{k/2} |\Sigma|^{1/2}} \exp \left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x -\mu)\right).\]

Proposition.

A finite transform of a Gaussian distributed vector \(X \sim \N(\mu, \Sigma)\) is again Gaussain \[\nu + U X \sim \N(\nu + U\mu, U\Sigma U^T).\]

Can be easily seen by considering the moment generating funtion.

\(\Box\)

Proposition.

If \(Z = (X, Y)^T\) is jointly Gaussian, \begin{align} \begin{pmatrix} X \\ Y \end{pmatrix} \sim \N \left( \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix} , \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{pmatrix} \right), \end{align} with dimensions \(p+q=k\), then the marginal distribution is also Gaussian, \begin{align} X \sim \N(\mu_1, \Sigma_{11}), \quad Y \sim \N(\mu_2, \Sigma_{22}), \end{align} and the conditional distribution is also Gaussian, \begin{align} X|Y &\sim \N(\mu_1 + \Sigma_{12}^T \Sigma_{11}^{-1}(Y - \mu_2), \Sigma_{11} - \Sigma_{12}^T \Sigma_{11}^{-1}\Sigma_{12}), \end{align} \begin{align} Y|X &\sim \N(\mu_2 + \Sigma_{21}^T \Sigma_{22}^{-1}(X - \mu_1), \Sigma_{22} - \Sigma_{21}^T \Sigma_{22}^{-1}\Sigma_{21}). \end{align}

Claim 1.

For regular symmetric \begin{align} A = \begin{pmatrix} A_{11} & A_{12} \\ A_{12}^T & A_{22} \end{pmatrix} \end{align} and inverse \[B = A^{-1} = \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{pmatrix}\] where the matrices are partionend in equally large blocks, we have \begin{align} B_{11} &= (A_{11} - A_{12}A_{22}^{-1}A_{12}^T)^{-1} \\ B_{22} &= (A_{22} - A_{12}^T A_{11}^{-1}A_{12})^{-1} \\ B_{12} &= B_{21}^T = -A_{11}^{-1}A_{12}(A_{22}-A_{12}^T A_{11}^{-1} A_{12})^{-1} \end{align}

We have \begin{align} I = AB &= \begin{pmatrix} A_{11}B_{11} + A_{12} B_{21} & A_{11}B_{12} + A_{12} B_{22} \\ A_{21}B_{11} + A_{22} B_{21} & A_{21}B_{12} + A_{22} B_{22} \end{pmatrix} \\ &= \begin{pmatrix} I_1 & 0 \\ 0 & I_2 \end{pmatrix} \end{align} Now we can treat this as system of linear equations and solve for the blocks.

\(\Box\)

Claim 2.

For symmetric \(A\) \begin{align} |A| &= |A_{22}|| A_{11} - A_{12} A_{22}^{-1} A_{12}^T | \\ &= |A_{11}|| A_{22} - A_{12}^T A_{11}^{-1} A_{12} | . \end{align}

\begin{align} A &= \begin{pmatrix} A_{11} & 0 \\ A_{12}^T & I_2 \end{pmatrix} \begin{pmatrix} I_1 & A_{11}^{-1}A_{12} \\ 0 & A_{22} - A_{12}^T A_{11}^{-1} A_{12} \end{pmatrix} \\ & = \begin{pmatrix} I_1 & A_{12} \\ 0 & A_{22} \end{pmatrix} \begin{pmatrix} A_{11} - A_{12} A_{22}^{-1} A_{12}^T & 0 \\ A_{22}^{-1}A_{12}^T & I_2 \end{pmatrix} \end{align}

\(\Box\)

Now write \begin{align} \Sigma^{-1} = \begin{pmatrix} \Sigma^{11} & \Sigma^{12} \\ \Sigma^{21} & \Sigma^{22} \end{pmatrix} \end{align} We have \begin{align} \Sigma^{11} &= (\Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{12}^T)^{-1} \\ &= \Sigma_{11}^{-1} + \Sigma_{11}^{-1}\Sigma_{12}(\Sigma_{22} - \Sigma_{12}^T\Sigma_{11}^{-1}\Sigma_{12})^{-1} \Sigma_{12}^T \Sigma_{11}^{-1} \end{align} \begin{align} \Sigma^{22} &= (\Sigma_{22} - \Sigma_{12}^T\Sigma_{11}^{-1}\Sigma_{12})^{-1} \\ &= \Sigma_{22}^{-1} + \Sigma_{22}^{-1}\Sigma_{12}^T (\Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{12}^T)^{-1} \Sigma_{12}\Sigma_{22}^{-1} \end{align} \begin{align} \Sigma^{12} &= -\Sigma_{11}^{-1}\Sigma_{12}(\Sigma_{22}-\Sigma_{12}^T \Sigma_{11}^{-1} \Sigma_{12})^{-1} = (\Sigma^{21})^T \end{align}

Now we can write \begin{align} &(z - \mu)^T \Sigma^{-1}(z - \mu) \\ &= ((x - \mu_1)^T, (y - \mu_2)^T) \begin{pmatrix} \Sigma^{11} & \Sigma^{12} \\ \Sigma^{21} & \Sigma^{22} \end{pmatrix} \begin{pmatrix} x - \mu_1 \\ y - \mu_2 \end{pmatrix} \\ &= (x - \mu_1)^T \Sigma^{11}(x - \mu_1) \\ &\quad+ 2(x - \mu_1)^T \Sigma^{12}(y - \mu_2) \\ &\quad+ (y - \mu_2)^T \Sigma^{22}(y - \mu_2) \\ &= (x - \mu_1)^T \Sigma_{11}^{-1}(x - \mu_1) \\ &\quad+ (\Sigma_{12}^T \Sigma_{11}^{-1}(x - \mu_1))^T \\ &\quad\quad\,(\Sigma_{22} - \Sigma_{12}^T\Sigma_{11}^{-1}\Sigma_{12})^{-1} \\ &\quad\quad\, (\Sigma_{12}^T \Sigma_{11}^{-1}(x - \mu_1) )\\ &\quad- 2(\Sigma_{12}^T \Sigma_{11}^{-1}(x - \mu_1))^T(\Sigma_{22}-\Sigma_{12}^T \Sigma_{11}^{-1} \Sigma_{12})^{-1}(y - \mu_2) \\ &\quad+ (y - \mu_2)^T (\Sigma_{22} - \Sigma_{12}^T\Sigma_{11}^{-1}\Sigma_{12})^{-1}(y - \mu_2) \\ &= (x - \mu_1)^T \Sigma_{11}^{-1}(x - \mu_1) \\ &\quad+ ((y - \mu_2)-\Sigma_{12}^T \Sigma_{11}^{-1}(x - \mu_1))^T \\ &\quad\quad\, (\Sigma_{22} - \Sigma_{12}^T\Sigma_{11}^{-1}\Sigma_{12})^{-1} \\ &\quad\quad\, ((y - \mu_2)-\Sigma_{12}^T \Sigma_{11}^{-1}(x - \mu_1)) \end{align}

Now let \begin{align} \Sigma_X &:= \Sigma_{11}, \\ \mu_{Y|X} &:= \mu_2 + \Sigma_{12}^T \Sigma_{11}^{-1}(x - \mu_1), \\ \Sigma_{Y|X} &:= \Sigma_{22} - \Sigma_{12}^T\Sigma_{11}^{-1}\Sigma_{12}, \end{align} and \begin{align} Q(x,y) &:= ((x - \mu_1)^T, (y - \mu_2)^T) \Sigma^{-1} \begin{pmatrix} x - \mu_1 \\ y - \mu_2 \end{pmatrix}, \\ Q_X(x) & := (x - \mu_1)^T \Sigma_X^{-1}(x - \mu_1), \\ Q_{Y|X}(y) &:= (y - \mu_{Y|X})^T \Sigma_{Y|X}^{-1}(y - \mu_{Y|X}). \end{align} We already showed that \[Q(x,y) = Q_X(x) + Q_{Y|X}(y)\] and from Claim 2 we have \[|\Sigma| = |\Sigma_X||\Sigma_{Y|X}|\]

For the marginal pdf it holds that \begin{align} f_X(x) &= \int_y f_Z(x,y)dy \\ &= \int_y \frac{1}{(2\pi)^{k/2} |\Sigma|^{1/2}} \exp\left( \frac{1}{2} Q(x,y) \right) dy \\ &= \frac{1}{(2\pi)^{p/2} |\Sigma_X|^{1/2}} \exp\left( \frac{1}{2} Q_X(x) \right) \\ &\quad\quad \int_y \frac{1}{(2\pi)^{q/2} |\Sigma_{Y|X}|^{1/2}} \exp\left( \frac{1}{2} Q_{Y|X}(y) \right) dy \\ &= \frac{1}{(2\pi)^{p/2} |\Sigma_X|^{1/2}} \exp\left( \frac{1}{2} Q_X(x) \right). \end{align}

Lastly, for the conditional distribution \begin{align} f_{Y|X}(y) &= \frac{f_Z(x,y)}{f_X(x)} \\ &= \frac{(2\pi)^{p/2} |\Sigma_X|^{1/2}}{(2\pi)^{k/2} |\Sigma|^{1/2}} \exp\left( \frac{1}{2}(Q_X(x) + Q_{Y|X}(y) - Q_X(x)) \right) \\ &= \frac{1}{(2\pi)^{q/2} |\Sigma_{Y|X}|^{1/2}} \exp\left( \frac{1}{2} Q_{Y|X}(y) \right) . \end{align}

\(\Box\)

Gaussian Processes

A Gaussian Process is a mapping from an input space \(\X\) to random variables, \(f:\X \to \Pd(\R).\) Every finite collection of these random variables is jointly Gaussian distributed.

As a Gaussian distributed random vector \(X = (X_1,\dots,X_n)^T \sim \N(\mu,\Sigma)\) can also be considered a mapping to random variables \(X:\{1,\dots,n\}\to \Pd(\R)\), a Gaussian process can be considered a generalisation of the Gaussian distribution from vectors to fuctions.

A Gaussian Process is completely defined by a mean function and covariance function \begin{align} m(x) &= \Exp{f(x)}, \\ k(x,x') &= \Exp{(f(x)-m(x))(f(x') -m(x'))}. \end{align}

So we can specify the mean function \(m\) and the kernel function \(k\) and get a Gaussian process \(f \sim GP(m,k).\)

For multiple inputs \(\bfX = (x_1, \dots, x_n)^T \in \X^n\) we have \begin{align} f(\bfX) \sim \N(m(\bfX), k(\bfX,\bfX)), \end{align} where \begin{align} m(\bfX)= \begin{pmatrix} m(x_1) \\ \vdots \\ m(x_n) \end{pmatrix}, \quad k(\bfX,\bfX) = \begin{pmatrix} k(x_1,x_1) & \cdots & k(x_1,x_n) \\ \vdots & & \vdots \\ k(x_n, x_1) & \cdots & k(x_n,x_n) \end{pmatrix}. \end{align}

A common choice for the kernel is the Squared Exponential (SE) \[k(x,x') = \sigma \exp\left(\frac{(x-x')^2}{\ell}\right)\] which is positive definite (Proof).

Predictions.

If we know function values \(\bfy = (y_1, \dots, y_n)\) at training inputs \(\bfX = (x_1, \dots, x_n)^T \in \X^n\) for a deterministic model \[F(x_i) = y_i \in \R,\] we can specify priors in terms of mean and kernel function and then make predictions by conditioning on the training inputs.

For test inputs \(\bfX_* = (x_1^*, \dots, x_k^*)\) we can calculate the probability of test outputs \(\bfy_* = (y_1^*, \dots, y_k^*)\) given our data \((\bfX, \bfy)\) by observing \begin{align} \begin{pmatrix} \bfy \\ \bfy_* \end{pmatrix} &\sim \N \left( \begin{pmatrix} m(\bfX) \\ m(\bfX_*) \end{pmatrix}, \begin{pmatrix} k(\bfX,\bfX) & k(\bfX,\bfX_*) \\ k(\bfX_*,\bfX) & k(\bfX_*,\bfX_*) \end{pmatrix} \right) \\ &:= \N\left( \begin{pmatrix} \mu \\ \mu_* \end{pmatrix}, \begin{pmatrix} K & K_* \\ K_*^T & K_{**} \end{pmatrix} \right). \end{align} We know that \begin{align} \bfy_* | \bfX, \bfy, \bfX_* &= F(\bfX_*) = \bfy_* | F(\bfX)=\bfy \\ &\sim \N(\mu_* + K_*^T K^{-1}(\bfy - \mu),K_{**} - K_*^T K^{-1}K_*). \end{align}

Posterior GP.

From this formula the definition of the posterior Gaussian Process \(GP(\hat m,\hat k)\) given data \((\bfX, \bfy)\) is clear \begin{align} \hat m (x) &:= k(x, \bfX) k(\bfX, \bfX)^{-1} (\bfy - m(\bfX)) \\ \hat k (x, x') &:= k(x,x') - k(x,\bfX)(\bfX, \bfX)^{-1}k(\bfX, x'). \end{align} Evaluating the posterior GP at \((\bfX_*, \bfy_*)\) is the same as evaluating the prior GP at \((\bfX_*, \bfy_*)\) given \((\bfX, \bfy)\) (prediction).

This process can look like the plots below, where the mean function is the solid blue line and the shaded region is the 95% confidence interval at each point. On the right we have observed data and thus we are more certain how the function looks arround this points.

Noisy Training Data.

If we only know \[F(x_i) + \epsilon_i = y_i,\] with \(\epsilon_i \sim \N(0,\sigma_\epsilon)\) independently, then we have to adjust the predicitions and posterior GP by replacing \(K\) with \(K + \sigma_\epsilon^2 \mathbf{I}\).

Sampling

A sample from a Gaussian Process \(GP(m,k)\) is a realisation \(f_i :\X \to \R\), so a function. As we only are able to evaluate the function at a finite amount of points \(\bfX_* = (x_1^*, \dots, x_k^*)\) we simply can sample \[\N(m(\bfX_*), k(\bfX_*, \bfX_*)) = \N(\mu_*, K_{**})\] or \[\N(\mu_* + K_*^T K^{-1}(\bfy - \mu),K_{**} - K_*^T K^{-1}K_*)\] if we have observed data.

Drawing 10 samples can look something like this:

Bayesian Selection of Kernel Paramters.

Since we want to make the predictions as accurate as possible, we can try to first adapt the hyperparameters of mean and kernel to fit our training data.

First note that \[ \Prob{f | \bfX, \bfy, \theta} = \frac{\Prob{\bfy| \bfX, f, \theta}\Prob{f|\theta}}{\Prob{\bfy | \bfX, \theta}}, \] where \(\theta\) are the hyperparamters and \[ \Prob{\bfy | \bfX, f, \theta} = \prod_{i=1}^n \Prob{y_i| f(x_i), \theta} \] is the likelihood of \(f\) (where the noise of the model plays the major role).

The marginal likelihood, also called evidence is defined by \[\Prob{\bfy | \bfX, \theta} = \int \Prob{\bfy | \bfX, f, \theta} \Prob{f |\theta} df.\]

The posterior of the hyperparamters is \[ \Prob{\theta | \bfX, \bfy} = \frac{\Prob{\bfy | \bfX, \theta}\Prob{\theta}}{\Prob{\bfy | \bfX}}. \] If we assume a flat prior, so \(\theta\) is chosen uniformly, we have \begin{align} &\log \Prob{\theta | \bfX, \bfy} \propto \log \Prob{\bfy | \bfX, \theta} \\ &\quad= -\frac{1}{2}(\bfy - m(\bfy))^T K^{-1}(\bfy - m(\bfy)) -\frac{1}{2}\log|K^{-1}| -\frac{n}{2}\log(2\pi), \end{align} which can be maximised.
This is called type II maximum likelihood estimation.

Implementation Tipps.

Computing the inversion of a matrix is numerically bad but since the kernels are usually positive definite we can use the Cholesky factorization.
Write \(K = U^T U\) for an upper triangular matrix \(U.\) Solve the system \(U^T L = K_*\) for \(L\) (fast). Since then \(L = (U^{-1})^T K_*\) we have \[L^T L = K_*^T U^{-1} (U^{-1})^T K_* = K_*^T K^{-1} K_*.\]

For sampling we can use a standard normal vector \(Z \sim \N(0, \mathbf{I})\) and unwhiten it with \(K_{**}\). That is, let \(U^T U = K_{**}\) and then \[U^T Z \sim \N(0, U^T \mathbf{I}U) = \N(0, K_{**}).\]