Model
Define the linear regression function as \[y(\bfx;\bfw) = \bfw^T \phi(\bfx)\] with basis functions \(\phi = (\phi_1, \dots, \phi_d)^T\) and \(\phi_0(\bfx) := 1\).
We assume that our variables follow a linear model with Gaussian noise \(\epsilon \sim \N(0, \beta^{-1})\) such that \[y = y(\bfx; \bfw) + \epsilon\] for some unknown \(\bfw.\)
Maximum Likelihood Estimation
If we have observed data \(\bfX = (\bfx_1, \dots \bfx_N)^T\), \(\bfy = (y_1, \dots, y_N)^T\) independently, we can find \(\bfw\) such that we maximise the probability of having observed our data.
The probability of observing given our data and parameters (likelihood) is \[p(\bfy|\bfX,\bfw,\beta) = \prod_{n=1}^N \N(y_n|y(\bfx_n;\bfw), \beta^{-1}).\] Maximising this quantity is equivalent to maximising its logarithm (log-likelihood) \[\ln p(\bfy|\bfX,\bfw,\beta) = \frac{N}{2}\ln\beta - \frac{N}{2}\ln(2\pi)-\beta E_D(\bfw),\] \[E_D(\bfw) = \frac{1}{2}\sum_{n=1}^N(y_n-\bfw^T\phi(\bfx_n))^2.\] We see that the maximum likelihood estimation of \(\bfw\) can be found by minimising the sum of squared residuals.
We can solve for \(\bfw\) analytically and obtain \[\bfw_{\text{ML}} = (\bfPhi ^T\bfPhi)^{-1}\bfPhi^T \bfy\] where \begin{align} \bfPhi = (\phi(\bfx_1),\dots, \phi(\bfx_N))^T = \begin{pmatrix} 1 & \phi_1(\bfx_1) & \dots & \phi_d(\bfx_1) \\ 1 & \phi_1(\bfx_2) & \dots & \phi_d(\bfx_2) \\ \vdots & & & \vdots \\ 1 & \phi_1(\bfx_N) & \dots & \phi_d(\bfx_N) \end{pmatrix} \end{align}
To avoid overfitting we can minimise the sum of squared residuals plus a regularisation term \[E_D(\bfw) + \lambda E_W(\bfw), \quad E_W(\bfw) = \frac{1}{2}\bfw^T\bfw.\] Then we have \[\bfw_{\text{ML}}^\lambda = (\lambda \bfI + \bfPhi^T \bfPhi)^{-1}\bfPhi^T \bfy.\]
\begin{align} \nabla_\bfw L(\bfw) & = \nabla_\bfw E_D(\bfw) + \lambda E_W(\bfw) \\ &= \nabla_\bfw \frac{1}{2}((\bfy - \bfPhi \bfw)^T(\bfy - \bfPhi \bfw) + \lambda \bfw^T\bfw) \\ &= -\bfPhi^T (\bfy - \bfPhi \bfw) + \lambda \bfw \\ & = -\bfPhi^T\bfy + (\lambda\bfI + \bfPhi^T\bfPhi) \bfw \end{align} \begin{align} \nabla^2_\bfw L(\bfw) = \lambda\bfI + \bfPhi^T\bfPhi \end{align} Thus \(\bfw = (\lambda \bfI + \bfPhi^T \bfPhi)^{-1}\bfPhi^T \bfy\) is a minimum as the Hessian is positive definite if \(\bfPhi\) is regular.
\(\Box\)
Gauss-Markov Theorem.
The ML estimator \(\bfw_\text{ML}\) is the best linear unbiased estimator (BLUE). \begin{align} \bfw_\text{ML} &= \underset{\hat\bfw: \Exp{\hat\bfw}=\bfw}{\operatorname{argmin}} \text{MSE}(\bfw - \hat\bfw) \\ &= \underset{\hat\bfw: \Exp{\hat\bfw}=\bfw}{\operatorname{argmin}} \Exp{(\bfw - \hat \bfw)^2} \\ &= \underset{\hat\bfw: \Exp{\hat\bfw}=\bfw}{\operatorname{argmin}} \Exp{\bfw - \Exp{\hat\bfw}}^2 + \Var{\hat\bfw} \\ &= \underset{\hat\bfw: \Exp{\hat\bfw}=\bfw}{\operatorname{argmin}} \Var{\hat\bfw} \end{align}
A linear estimator is of form \[\hat \bfw = \bfC \bfy.\] Define \(\bfD = \bfC - (\bfPhi^T \bfPhi)^{-1}\bfPhi^T\). \begin{align} \Exp{\hat \bfw} &= \Exp{\bfC \bfy} = \Exp{\bfC(\bfPhi \bfw + \epsilon)} \\ &= (\bfPhi^T \bfPhi)^{-1}\bfPhi^T + \bfD)\bfPhi \bfw \\ &= (\bfI + \bfD\bfPhi)\bfw \end{align} Thus \(\bfD\bfPhi = 0.\) \begin{align} \Var{\hat\bfw} &= \Var{\bfC\bfy} = \bfC \Var{\bfy} \bfC^T = \beta^{-1}\bfC\bfC^T \\ &=\beta^{-1} ( (\bfPhi^T \bfPhi)^{-1}\bfPhi^T \bfPhi (\bfPhi^T \bfPhi)^{-1}\\ &\quad+ (\bfPhi^T \bfPhi)^{-1}\bfPhi^T\bfD^T \\ &\quad+ \bfD \bfPhi (\bfPhi^T \bfPhi)^{-1} \\ &\quad + \bfD\bfD^T) \\ &=\beta^{-1} (\bfPhi^T \bfPhi)^{-1} + \beta^{-1}\bfD\bfD^T \\ &= \Var{\hat\bfw} + \beta^{-1}\bfD\bfD^T \ge \Var{\hat\bfw} \end{align}
\(\Box\)
Bayesian Estimation
When using the Bayesian framework we need a prior over the parameters \(p(\bfw)\) and use the observed data \(\D\) to update our knowledge of the paramters with the help of Bayes rule \[p(\bfw | \D) = \frac{p(\D | \bfw)p(\bfw)}{p(\D)}\]
We know that \[p(\D | \bfw) = \prod_{n=1}^N \N(y_n|y(\bfx_n;\bfw), \beta^{-1})\] is the exponential of a quadratic function \(\bfw\) and therefore the conjugate prior is a given by a Gaussian distribution \[p(\bfw) = \N(\bfw | \bfm_0, \bfS_0)\] with \[p(\bfw|\D) = \N(\bfw|\bfm, \bfS)\] where \begin{align} \bfm &= \bfS(\bfS_0^{-1}\bfm_0 + \beta \bfPhi^T \bfy) \\ \bfS^{-1} &= \bfS_0^{-1} + \beta \bfPhi^T \bfPhi. \end{align}
\begin{align} &-2\ln p(\D | \bfw) p(\bfw) \\ &\quad= \beta (\bfy - \bfPhi\bfw)^T(\bfy - \bfPhi\bfw) +(\bfw- \bfm_0)^T\bfS_0^{-1}(\bfw - \bfm_0) + \text{const} \\ &\quad= \beta\bfy^T\bfy - \beta2\bfy^T\bfPhi\bfw + \beta\bfw^T\bfPhi^T\bfPhi \bfw \\ &\quad\quad\, + \bfw\bfS_0^{-1}\bfw - 2\bfw^T\bfS_0^{-1}\bfm_0 + \bfm_0^T\bfS_0^{-1}\bfm_0 + \text{c.} \\ &\quad= \bfw^T \bfS^{-1}\bfw - 2\bfw^T \bfS^{-1} \bfS (\beta \bfPhi^T\bfy + \bfS_0^{-1}\bfm_0) + \text{c.} \\ &\quad= \bfw^T \bfS^{-1}\bfw - 2\bfw^T \bfS^{-1} \bfm + \bfm^T \bfS^{-1} \bfm - \bfm^T \bfS^{-1} \bfm + \text{c.} \\ &\quad= (\bfw - \bfm)^T \bfS^{-1}(\bfw - \bfm) + \text{const} \end{align}
\(\Box\)
If we restrict the prior to a zero-mean isotropic Gaussian, \(p(\bfw|\alpha) = \N(\bfw| \mathbf{0}, \alpha^{-1}\bfI)\), we have \begin{align} \bfm &= \beta\bfS \bfPhi^T \bfy\\ \bfS^{-1} &= \alpha \bfI + \beta \bfPhi^T \bfPhi. \end{align} In this case we have \[\ln p(\bfw|\D) = -\frac{\beta}{2}\sum_{n=1}^N(y_n-\bfw^T\phi(\bfx_n))^2 - \frac{\alpha}{2}\bfw^T\bfw + \text{const}. \] Therefore, maximising the posterior is equivalent to the maximum likelihood estimation with regularisation \(\lambda = \frac{\alpha}{\beta}\) \[\bfw_{\text{MAP}}^\alpha = \bfw_\text{ML}^\lambda.\] Letting \(\alpha \to 0\) (making the prior broader) is equivalent to the maximum likelihood estimation without regularisation, \[\lim_{\alpha \to 0}\bfw_{\text{MAP}}^\alpha = \bfw_\text{ML}.\]
Predictive distribution
In the Bayesian framework we are usually not interested in a point estimate of \(\bfw\) but want to directly make predictions by averageing over all parameters. The predictive distribution at \(\bfx\) is given by \[p(y|\bfx, \D) = \int p(y|\bfx, \bfw)p(\bfw|\D) \text{d}\bfw.\] Recall that \[p(y|\bfx, \bfw) = \N(y|y(\bfx;\bfw), \beta^{-1}), \quad p(\bfw|\D) = \N(\bfw|\bfm, \bfS).\] Thus, \[p(y|\bfx, \D) = \N(y|y(\bfx; \bfm), \sigma^2_N(\bfx))\] where \begin{align} y(\bfx; \bfm) = \bfm^T \phi(\bfx), \quad \sigma^2(\bfx) = \frac{1}{\beta} + \phi(\bfx)^T \bfS \phi(\bfx). \end{align}
Claim.
For \(p(\bfx)=\N(\bfx | \bfm, \bfS)\), \(p(\bfy|\bfx) = \N(\bfA \bfx + \bfb, \bfL)\) then \begin{align} p(\bfy) = \N(\bfy | \bfA\bfm + \bfb, \bfL + \bfA \bfS \bfA^T). \end{align}
Proof.
Set \(\bfz = (\bfx, \bfy)^T\). We use following decomposition \begin{align} -\frac{1}{2}(\bfz - \mu)^T\Sigma^{-1}(\bfz - \mu) = -\frac{1}{2} \bfz^T\Sigma^{-1} \bfz + \bfz^T \Sigma^{-1} \mu + \text{const.} \end{align}
We look at the quadratic terms. \begin{align} \ln p(\bfz) &= \ln p(\bfx) + \ln p(\bfy|\bfx) \\ &= -\frac{1}{2}(\bfx - \bfm)^T \bfS^{-1}(\bfx - \bfm)\\ &\quad\,-\frac{1}{2}(\bfy - \bfA\bfx - \bfb)^T \bfL^{-1}(\bfy - \bfA\bfx - \bfb) + \text{const} \\ &= -\frac{1}{2}\bfx^T(\bfS^{-1} + \bfA^T \bfL^{-1} \bfA)\bfx -\frac{1}{2} \bfy^T \bfL^{-1}\bfy\\ &\quad\, + \frac{1}{2}\bfy^T\bfL^{-1}\bfA \bfx + \frac{1}{2} \bfx^T\bfA^T\bfL^{-1}\bfy + \dots \\ &= -\frac{1}{2} \begin{pmatrix} \bfx \\ \bfy \end{pmatrix}^T \begin{pmatrix} \bfS^{-1} + \bfA^T \bfL^{-1} \bfA & -\bfA^T\bfL^{-1} \\ -\bfL^{-1}\bfA & \bfL^{-1} \end{pmatrix} \begin{pmatrix} \bfx \\ \bfy \end{pmatrix} + \dots \end{align} Inverting this matrix gives \begin{align} \operatorname{cov}[\bfz] = \begin{pmatrix} \bfS & \bfS \bfA^T \\ \bfA \bfS & \bfL + \bfA \bfS \bfA^T \end{pmatrix}. \end{align} The linear terms are \begin{align} \bfx^T \bfS^{-1} \bfm - \bfx^T \bfA^T \bfL^{-1}\bfb + \bfy^T\bfL^{-1} \bfb = \begin{pmatrix} \bfx \\ \bfy \end{pmatrix}^T \begin{pmatrix} \bfS^{-1}\bfm - \bfA^T \bfL^{-1} \bfb\\ \bfL^{-1}\bfb \end{pmatrix}. \end{align} Therefore \begin{align} \Exp{\bfz} = \begin{pmatrix} \bfm \\ \bfA \bfm + \bfb \end{pmatrix}. \end{align}
\(\Box\)
Write \(\bfx = \bfw, \bfy = y, \bfm = \bfm, \bfS = \bfS, \bfA = \phi(\bfx)^T, \bfb = 0, \bfL = \frac{1}{\beta}.\)
\(\Box\)
Unknown Variance
If \(\beta\) is treated as unkown the conjugate prior is Guassian-gamma distributed and the predictive distribution is a Student's t - distribution.
\begin{align} p(\bfw, \beta) &= \N(\bfw|\bfm_0, \beta^{-1}\bfS_0)\Gamma(\beta|a_0, b_0), \\ p(y|\bfx, \bfw, \beta) &= \N(y|y(\bfx;\bfw), \beta^{-1}). \end{align} Then \begin{align} p(\bfw, \beta|\D) &= \N(\bfw|\bfm, \beta^{-1}\bfS)\Gamma(\beta|a, b), \end{align} where \begin{align} \bfm &= \bfS(\bfS_0^{-1}\bfm_0 + \bfPhi^T \bfy), \\ \bfS^{-1} &= \bfS_0^{-1} + \bfPhi^T \bfPhi, \\ a &= a_0 + \frac{N}{2}, \\ b &= b_0 + \frac{1}{2}(\bfy^T\bfy + \bfm_0^T \bfS_0^{-1}\bfm_0 - \bfm^T \bfS^{-1} \bfm). \end{align} \begin{align} p(y|\bfx, \D) &= \int \int p(y|\bfx, \bfw, \beta)p(\bfw, \beta|\D) \text{d}\bfw \d\beta \\ &= \text{St}(\nu = 2a, \lambda = (1 + \phi(\bfx)^T\bfS \phi(\bfx)) \frac{a}{b}, \mu = \phi(\bfx)^T \bfm). \end{align}
The argument of \(\exp\) in \(p(\D | \bfw, \beta) p(\bfw, \beta)\) is \[-\frac{\beta}{2}(\bfy - \bfPhi \bfw)^T(\bfy - \bfPhi \bfw) -\frac{\beta}{2}(\bfw - \bfm_0)^T\bfS_0^{-1}(\bfw - \bfm_0) - b_0 \beta.\] The factor is \[\frac{1}{(2\pi)^{p/2}|\bfS_0|^{1/2}}\left(\frac{\beta}{2\pi}\right)^{N/2}\frac{b_0^{a_0}}{\Gamma(a_0)} \beta^{a_0-1}.\] Similar to above we have \begin{align} \ln p(\D | \bfw, \beta) p(\bfw, \beta) &= \underbrace{-\frac{\beta}{2}(\bfw - \bfm)^T \bfS^{-1} (\bfw - \bfm)}_{\Delta^2} \\ &- \beta \underbrace{\frac{1}{2}(\bfy^T\bfy + \bfm_0^T \bfS_0^{-1}\bfm_0 - \bfm^T \bfS^{-1} \bfm)}_{\Delta}- b_0 \beta + \text{c.} \end{align} for \begin{align} \bfm &= \bfS(\bfS_0^{-1}\bfm_0 + \bfPhi^T \bfy), \\ \bfS^{-1} &= \bfS_0^{-1} + \bfPhi^T \bfPhi. \end{align} Also \begin{align} p(\D | \bfw, \beta) p(\bfw, \beta) &\propto \beta^{N/2}\exp\left(\Delta^2- \beta(b_0 + \Delta) \right) \beta^{a_0-1} \end{align} which gives \begin{align} a &= a_0 + \frac{N}{2}, \\ b &= b_0 + \frac{1}{2}(\bfy^T\bfy + \bfm_0^T \bfS_0^{-1}\bfm_0 - \bfm^T \bfS^{-1} \bfm). \end{align}
Further \begin{align} p(y|\bfx, \D) &= \int \Gamma(\beta|a, b) \int \N(y|y(\bfx;\bfw), \beta^{-1})\N(\bfw|\bfm, \beta^{-1}\bfS) \d\bfw \d\beta \\ &= \int \Gamma(\beta|a, b) \N(y|\phi(\bfx)^T \bfm, \beta^{-1}(1 + \phi(\bfx)^T\bfS \phi(\bfx)))\d\beta \\ &=: \int \Gamma(\beta|a, b) \N(y|\mu, \beta^{-1}\delta^{-1})\d\beta \\ &= \int_0^\infty \frac{b^a \beta^{a-1}}{\Gamma(a)} \left(\frac{\beta\delta}{2\pi}\right)^{1/2}\exp\left(-\beta \frac{\delta(y-\mu)^2}{2} - \beta b\right) \d \beta \end{align} Substitute \(z = \beta \left(b + \frac{\delta(y-\mu)^2}{2}\right) =: \beta \kappa\) \begin{align} &= \int_0^\infty \frac{b^a z^{a-1/2}}{\Gamma(a)\kappa^{a-1/2}} \left(\frac{\delta}{2\pi}\right)^{1/2}\exp\left(-z\right) \frac{1}{\kappa} \d z \\ &= \frac{\Gamma(a+1/2)}{\Gamma(a)} b^a \left(\frac{\delta}{2\pi}\right)^{1/2} \kappa^{-a-1/2} \\ &= \frac{\Gamma(a+1/2)}{\Gamma(a)} \left(\frac{\delta}{2\pi b}\right)^{1/2} \left(1 + \frac{\delta(y-\mu)^2}{2b}\right)^{-a-1/2} \\ &= \frac{\Gamma(\nu/2+1/2)}{\Gamma(\nu/2)} \left(\frac{\lambda}{\nu\pi }\right)^{1/2} \left(1 + \frac{\lambda(y-\mu)^2}{\nu}\right)^{-\nu/2-1/2} \end{align} For \(\nu = 2a=\), \(\lambda = \frac{\delta a}{b} = (1 + \phi(\bfx)^T\bfS \phi(\bfx)) \frac{a}{b}\), \(\mu = \phi(\bfx)^T \bfm\).
\(\Box\)
Curve Fitting
Consider the model \[y = \sin(2 \pi x) + \epsilon\] where \(\epsilon \sim \N(0, 1/5)\).
We choose \[\phi_j(x) = \exp\left(\frac{(x-a_j)^2}{a_j^2}\right), \quad a_j \in \{0, \pm 1, \pm 2, \pm3\}.\]
We have observed follwing data
The maximum likelihood estimation looks like
The Baysian estimation with \(\alpha = 10^{-6}\) like
Bias - Variance Trade-Off
Define the squared loss as \(L(\bfx, y) =(y(\bfx) - y)^2\) where \(y\) is the observed value at \(\bfx.\)
The expected loss is then \[\Exp{L} = \int \int L(\bfx, y) p(\bfx, y) \d \bfx \d y.\]
We can write \begin{align} L(\bfx, y) &= (y(\bfx) - \Exp{y|\bfx} + \Exp{y|\bfx} - y)^2 \\ &= (y(\bfx) - \Exp{y|\bfx})^2 + (\Exp{y|\bfx} - y)^2 \\ &\quad+ 2(y(\bfx)-\Exp{y|\bfx})(\Exp{y|\bfx} - y) \end{align} and thus \[\Exp{L} = \int (y(\bfx) - \Exp{y|\bfx} )^2 p(\bfx) \d \bfx + \int \int (\Exp{y|\bfx} - y)^2 p(\bfx, y) \d\bfx \d y\] Therefore, the loss is smallest if \(y(\bfx) = \Exp{y|\bfx}.\)
Let's make our linear regression dependence on the data \(\D\) explicit \[y(\bfx; \D) = y(\bfx; \bfw) = y(\bfx).\] We can write \begin{align} (y(\bfx) - \Exp{y|\bfx})^2 &= (y(\bfx) - \Exp[\D]{y(\bfx; \D)} + \Exp[\D]{y(\bfx; \D)} - \Exp{y|\bfx})^2 \\ &= (y(\bfx; \D) - \Exp[\D]{y(\bfx; \D)})^2 + (\Exp[\D]{y(\bfx; \D)} - \Exp{y|\bfx})^2 \\ &\quad + 2(y(\bfx; \D) - \Exp[\D]{y(\bfx; \D)})(\Exp[\D]{y(\bfx; \D)} - \Exp{y|\bfx}) \end{align} and obtain \begin{align} &\Exp[\D]{(y(\bfx;\D) - \Exp{y|\bfx})^2} = \\ &\quad (\Exp[\D]{y(\bfx; \D)} - \Exp{y|\bfx})^2 + \Exp[\D]{(y(\bfx; \D) - \Exp[\D]{y(\bfx; \D)})^2} \end{align}
We conclude \[\text{expected loss} = \text{bias}^2 + \text{variance} + \text{noise}\] where \begin{align} \text{bias}^2 &= \int (\Exp[\D]{y(\bfx; \D)} - \Exp{y|\bfx})^2 p(\bfx) \d \bfx, \\ \text{variance} &= \int \Exp[\D]{(y(\bfx; \D) - \Exp[\D]{y(\bfx; \D)})^2} p(\bfx) \d \bfx, \\ \text{noise} &= \int \int (y - \Exp{y|\bfx})^2 p(\bfx, y) \d \bfx \d y. \end{align}