Bayesian Inference
We assume that we have some kind of model and want to estimate the paramters of the model by observing some data.
Let's denote the observed data with \(D\) and the parameters with \(Q\) which are both considered to be random variables.
Define
\[\pi(q|d) = \P(Q=q|D=d)\]
the probability of parameter \(q\) given the observed data \(d.\)
Similiarly let
\[\pi(d|q) = \P(D=d|Q=q)\]
be the likelihood of the parameter \(q\). It is the probability that we observe the data \(d\) given the underlying model parameter \(q.\)
By Bayes' theorem \begin{align} \pi(q|d) = \frac{\pi(d|q)\pi_0(q)}{\pi_D(d)} = \frac{\pi(d|q)\pi_0(q)}{\int \pi(d|q)\pi_0(q)dq}, \end{align} where \(\pi_0(q) = \P(Q=q)\) is the prior information about the parameter and \(\pi_D(d) = \P(D=d)\) is the probability of the data. \(\pi(q|d)\) is called the posterior information about the parameter (after we observed the data).
In general, \(\pi_0\) is chosen through domain knowledge (e.g. the physical bounds of the parameter) and \(\pi(d|q)\) is derived from the model assumptions. Calculating \(\pi_D(d)\) is hard but there are tricks how this can be avoided.
Model Assumptions
Our assumption is that we have a correct model parametrised by \(Q\) and observed the data with standard normal noise. \[d_i = f(t_i;Q) + \epsilon_i,\] where \(\epsilon_i \sim \mathcal{N}(0,\sigma^2)\) independently. Note that \(\sigma\) is assumed to be known (for example measurement precision).
For this model assumption we can easily compute the likelihood \begin{align} \pi(d|q) &= \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(d_i-f(t_i;q))^2}{2\sigma^2}\right) \\ &= \frac{1}{(2\pi\sigma^2)^{N/2}}\exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n (d_i - f(t_i;q))^2\right). \end{align}
Example
Assume that we want to investigate the population dynamics of rabbits \(x\) and foxes \(y\). We know following relation of the population for discrete time steps \begin{align*} x_{t+1} &= \alpha x_t - \beta x_t y_t \\ y_{t+1} &= \gamma y_t + \delta x_t y_t. \end{align*} Assume that we know that foxes die off exponentially without prey \[y_{t+1} = 0.9 y_t \quad (\gamma = 0.9). \] Further, assume we understand the population dynamics of rabbits with predator interaction completely \[x_{t+1} = 1.1 x_t - 0.15 x_t y_t \quad (\alpha = 1.1, \beta = 0.15).\] It remains to determine \(\delta\). Luckily we know that \(0.05 \le \delta \le 0.20\) and observed following data with 5 percent precison \(\sigma = 0.05.\)
But how do we find the posterior?
Metropolis Hastings Algorithm
Markov Chain Monte Carlo.
A Markov chain is a sequence of random variables \(X_n\) called states such that
\begin{align}
\P(X_{n+1}=x_{n+1}|X_1=x_1, \dots, X_n=x_n) \\= \P(X_{n+1}=x_{n+1}|X_n=x_n),
\end{align}
so state transitions only depend on the predecessor state.
A homogeneous (time independent) Markov chain can be described by a set of states, initial state distribution \(X_0 \sim p_0\) and a transition kernel
\[p_{ij} = \P(X_{n+1}=x_j|X_n=x_i).\]
If the state space is finite \(p_{ij}\) can be captured as a transition matrix \(P.\)
The distribution of states as time progresses are given by
\[p^n := p^0 P^n.\]
If a limiting return distribution \(\pi\) exists it must satisfy
\[\pi = \pi P\] which is called stationary.
If the distribution \(\pi\) is reversible \(\pi p_{ij} = \pi_j p_{ji}\) it is also stationary (detailed balance).
Theorem.
A finite, homogeneous, irreducible and aperiodic Markoc chain has a unique stationary distribution which is also the limiting state distribution of any initial state distribution.
Now the goal is to construct a Markov chain with the posterior as stationary distribution.
Deriving the algorithm.
For arbitrary distribution \(\omega\) which we already know up to a factor we can construct a Markow chain as follows.
As we have to satisfy the detailed balance condition we have \[P(q'|q) \omega(q) = P(q|q') \omega(q').\] We separate the transition probability \(P\) into proposal \(J\) and acceptance \(A\) proability \[P(q'|q) = J(q'|q) A(q'|q).\] So far we have \[\frac{A(q'|q)}{A(q|q')} = \frac{\omega(q')J(q|q')}{\omega(q)J(q'|q)}.\] We can solve for \(A\) \[A(q'|q) = \min\left(1,\frac{\omega(q')J(q|q')}{\omega(q)J(q'|q)}\right).\] It is clear that we only need to know \(\omega\) up to a factor. If we choose \[J(.|q) \sim \mathcal{N}(q,V)\] the condtions of above theorem hold and \(\omega\) is the unique stationary and limiting distribution of the Markov chain.
It does not seem that impressive to be able to generate a distribution when we already know it up to factor.
But wait a second!
We know
\[\pi(q|d) \propto \pi(d|q)\pi_0(q)\]
and the normalising factor
\[\pi_D(d) = \int \pi(d|q)\pi_0(q)dq\]
is hard to compute whereas \(\pi(d|q)\pi_0(q)\) is not. An ideal application of MCMC.
The algorithm.
1. Start at a random guess \(q=q_0.\)
2. Propose \(q' \sim J(.|q)\).
3. Accept \(q'\) with probability \[A(q'|q) = \min\left(1,\frac{\pi(d|q')\pi_0(q')J(q|q')}{\pi(d|q)\pi_0(q)J(q'|q)}\right)\] or else discard it.
4. Repeat 2. and 3.
If then all accepted \(q\)s are collected and a burn in period is discarded the \(q\)s are distributed according to \(\pi(.|d)\).
Tipps.
If \(J(.|q) \sim \mathcal{N}(q,V)\) one can generate \(q'\) by generating \(z \sim \mathcal{N}(0,I)\) first and then computing \[q' = q + V^{1/2}z.\] This process is called unwhitening and for positive definite \(V\) the square root can be calculated by a Choletsky decomposition \(V = L L^T\), \(V^{1/2}=U^T=L.\)
If \(J\) is symmetric (like with the gaussian proposal) the acceptance probability simplifies to \[A(q'|q) = \min\left(1,\frac{\pi(d|q')\pi_0(q')}{\pi(d|q)\pi_0(q)}\right).\]
The constant \(\frac{1}{(2\pi\sigma^2)^{N/2}}\) for the likelihood also cancels in the fraction which results in numerical stabilty.
Back to the example.
Letting the Metropolis algorithm run with uniform prior for \(10^5\) steps and discarding the first tausend \(q\)s as burn in we get following histogram.
We can see that we have 95% confidence bounds of \([0.1195, 0.1218]\) and maximum a posteriori (MAP) estimate \(\delta \approx 0.1207.\)
Comparing this to the data and true dynamics with which the data was generated we see the impressive accuracy.