Model
A generalized linear model (GLM) consists of an exponential family of distributions specfied (in its canonical form) by parameters \(\bfeta\), functions \(h, g, \bfu\) and dispersion parameter \(s\) such that \[p(\bfy | \bfeta, s) = h(\bfy, s)\exp\left(\frac{1}{s}\left(\bfeta^T \bfu(\bfy)- g(\bfeta)\right)\right),\] a linear predictor \[\bfa = \bfW^T \bfx \in \R^{p};\] and a link function \(f^{-1}\) such that \[\Exp{\bfy | \bfx} = f(\bfa) = f(\bfw^T \bfx) \in \R^p.\]
Variable \(\bfy \in \R^p\) is to be predicted from \(\bfx \in \R^d\), we assume a distribution on the target variable (an exponential family) and link it to a linear predictor \(\bfx^T\bfw\) by function \(f^{-1}\). The function \(f\) is called activation function. Note that \(\bfeta\) can be thought of a function of input \(\bfx\), \(\bfeta = \bfeta(\bfx).\)
Deriving the Derivatives
From now on we assume that \(\bfu = \operatorname{id}\). In this case \(\bfeta\) is called canonical parameter. For all exponential families it holds that \[\hat \bfy := f(\bfW^T \bfx) = \Exp{\bfy|\bfeta} = \Exp{\bfu(\bfy)|\bfeta} = \nabla g(\bfeta) .\] Therefore, there must exist a functional relation \[\bfeta = \psi(\hat\bfy).\] Now we can write the log likelihood as \begin{align} \ln p (\bfy | \bfeta, s) &= \frac{1}{s} \left(\bfeta^T \bfy - g(\bfeta)\right) + \ln h(\bfy,s)\\ &= \frac{1}{s} \left(\psi(f(\bfa))^T \bfy - g(\psi(f(\bfa))) \right) + \ln h(\bfy,s). \end{align} Now it would be nice if we can choose the link function such that \(\psi \circ f = \operatorname{id}\) and obtain \[\ln p (\bfy | \bfeta, s) =\frac{1}{s} \left(\bfa^T \bfy - g(\bfa) \right) + \ln h(\bfy,s).\] We then get \[\bfeta = \psi(\hat\bfy) = \psi(f(\bfa)) = \bfa = \bfW^T \bfx, \quad f^{-1} = \psi.\] Differentiating with respect to \(\bfW\) gives \begin{align} \nabla_\bfW (-\ln p (\bfy | \bfeta, s)^T) &= \frac{1}{s}\nabla g(\bfa)^T \d \bfa(\bfW) - \frac{1}{s} \bfy^T \d \bfa(\bfW) \\ &= \frac{1}{s}\hat\bfy^T \d \bfa(\bfW) - \frac{1}{s} \bfy^T \d \bfa(\bfW) \\ &= \frac{1}{s}(\hat\bfy - \bfy)^T \d \bfa(\bfW) . \end{align}
The gradient can be written in matrix form \[ \nabla_\bfW (-\ln p (\bfy | \bfeta, s)) = \frac{1}{s}\bfx^T(\hat\bfy - \bfy) \in \R^{d \times p}\]
As \(\bfW = (\bfw_1, \dots, \bfw_p) \in \R^{d \times p} \cong \R^{dp}\) we have \(\d \bfa(\bfW) \in \R^{p \times (dp)}.\) But as the \(j\)-th row of \(\bfa\) is only dependent on the \(j\)-th row of \(\bfW^T\) we can write \begin{align} \d \bfa(\bfW) = \begin{pmatrix} (\nabla_{\bfw_1} a_1)^T & 0 & \cdots & 0\\ 0 & (\nabla_{\bfw_2} a_2)^T & \cdots & 0\\ \vdots &\vdots & \ddots & \vdots \\ 0 & 0 & \cdots & (\nabla_{\bfw_p} a_p)^T \end{pmatrix} \end{align} and as all \(\nabla_{\bfw_j} a_j = \bfx\) the resulting gradient can be written as \(\frac{1}{s}\bfx(\hat\bfy - \bfy)^T.\)
\(\Box\)
The hessian can be computed in terms of following blocks: \[\nabla_{\bfw_i}\nabla_{\bfw_j} (-\ln p (\bfy | \bfeta, s)) = \frac{1}{s}\frac{\partial f_j}{\partial a_i} (\bfa) \bfx \bfx^T\]
\begin{align} \nabla_{\bfW}(\hat y_j - y_j)^T = \d f_j(\bfa) \d\bfa (\bfW) = (\nabla f_j(\bfa)_1\bfx^T, \dots, \nabla f_j(\bfa)_p \bfx^T) \end{align} and thus \begin{align} \nabla_{\bfw_i}(\hat y_j - y_j) = \d f_j(\bfa) \d\bfa (\bfW) = \nabla f_j(\bfa)_i\bfx \end{align} which results in \begin{align} \nabla_{\bfw_i}\nabla_{\bfw_j} (-\ln p (\bfy | \bfeta, s)) = \frac{1}{s}\nabla_{\bfw_i}(\hat y_j - y_j) \bfx= \frac{1}{s}\frac{\partial f_j}{\partial a_i}(\bfa) \bfx \bfx^T. \end{align}
\(\Box\)
Now we can use the same Newton-Raphson optimisation algorithm for several distributions, \begin{align} \overline\bfW_\text{new} = \overline\bfW_\text{old} - \bfH^{-1}\nabla_{\overline\bfW} \ln p(\bfY | \bfX), \end{align} where the matrix \(\bfW\) is in its vectorised form \(\overline \bfW \in \R^{dp}.\)
For independent observations \(\bfX = (\bfx_1, \dots, \bfx_N)^T \in \R^{n \times d}\), \(\bfY = (\bfy_1, \dots, \bfy_N)^T \in \R^{n \times p}\), where \((\bfx_j)_0 = 1\) if we use an intercept, the likelihood factorises and thus the batch gradients and hessian are given by \begin{align} -\nabla_{\bfW}\ln p(\bfY | \bfX) &=\frac{1}{s}\sum_{n=1}^N \bfx_n^T(\hat\bfy_n - \bfy_n) \in \R^{d \times p} \\ &= \frac{1}{s}\bfX (\hat\bfY - \bfY), \\ -\nabla_{\bfw_i}\nabla_{\bfw_j} \ln p (\bfY | \bfX) &= \frac{1}{s}\sum_{n=1}^N \frac{\partial f_j}{\partial a_i}(\bfa_n) \bfx_n \bfx_n^T\\ &= \bfX^T \bfR \bfX, \\& \quad \text{where} \quad \bfR = \frac{1}{s}\text{diag}\left(\frac{\partial f_j}{\partial a_i}(\bfa_n), n=1,\dots,N\right). \end{align}
Examples.
Univariate Gaussian.
Here we have \[p(y|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(y-\mu)^2}{2\sigma^2} \right).\] This can be rewritten with \begin{align} \eta &= \mu, \\ s &= \sigma^2, \\ g(\eta) &= \frac{\eta^2}{2}, \\ h(y, s) &= \frac{1}{\sqrt{2\pi s}}\exp\left(-\frac{y^2}{2s}\right). \end{align} We get \(\hat y = \eta\) and therefore \(\psi = \operatorname{id}\) and \(f = \operatorname{id}.\) We obtained univariate linear regression \(\hat y (\bfx) = \bfw^T\bfx!\) The associated loss function is the mean squared error \[\ln p(\bfy | \bfx, \sigma^2, \bfw) = \frac{1}{\sigma^2} \sum_{n=1}^N \left(y_n - \hat y(\bfx)\right)^2 + \text{const.}\]
Isotropic Mulitvariate Gaussian.
\[p(\bfy | \bfmu, \sigma^2) = \frac{1}{(2\pi \sigma^2)^{d/2}} \exp\left(-\frac{1}{2\sigma^2}(\bfy - \bfmu)^T(\bfy - \bfmu)\right). \] It is easy to see that \begin{align} \bfeta &= \bfmu, \\ s &= \sigma^2, \\ g(\bfeta) &= \frac{1}{2}\bfmu^T\bfmu, \\ h(\bfy, s) &= \frac{1}{(2\pi s)^{d/2}} \exp \left(-\frac{1}{2s}\bfy^T\bfy\right). \end{align} Again we have \(\hat \bfy = \bfeta\) and obtained multivariate linear regression \(\hat\bfy(\bfx) = \bfW^T\bfx\) with mean squared loss \[\ln p(\bfY | \bfY, \sigma^2, \bfW) = \frac{1}{\sigma^2} \sum_{n=1}^N ||\bfy_n - \hat y(\bfx)||^2_2 + \text{const.}\]
Bernoulli.
\[p(y|\mu) = \mu^y (1-\mu)^{1-y}, \quad y \in \{0,1\}\] Write the mass function in form of \[p(y|\mu) = \exp\left(y\ln \mu + (1-y)\ln(1-\mu)\right) = \exp\left(\ln\left(\frac{\mu}{1-\mu}\right)y +\ln(1-\mu)\right)\] \begin{align} \eta &= \ln\left(\frac{\mu}{1-\mu}\right) \quad \text{(logit function)}, \\ s &= 1, \\ f(\eta) &= \frac{1}{1 + \exp(-\eta)} \quad \text{(sigmoid function $\sigma$)},\\ g(\eta) &= \ln(1 + \exp(\eta)),\\ h(\eta, s) &= 1. \end{align} We got two-class logistic regression \(\hat y(\bfx) = \sigma(\bfw^T\bfx)\)! Here we have the cross entropy loss function \[\ln p(\bfy | \bfx, \bfw) = \sum_{n=1}^N y_n \ln \hat y_n + (1-y)\ln(1-\hat y_n).\] Lastly, \[f'(a) = \frac{\exp(-a)}{(1+\exp(-a))^2} = \frac{1}{1+\exp(-a)}\left(1-\frac{1}{1+\exp(-a)} \right) = \hat y (1 - \hat y). \] The same link and activation function work for the Binomial distribution, but with \(g(\eta) = n \ln(1+\exp(\eta))\) and \(h(y) = \left(\frac{n}{y} \right).\)
Categorical.
\[p(\bfy | \bfmu) = \prod_{i=1}^p \mu_i^{y_i}, \quad \text{$\bfy$ is onehot enoding.} \] This density function can be expressed as \[p(\bfy | \bfmu) = \exp\left(\sum_{i=1}^p y_i \ln \mu_i\right).\] An easy variant would be \(\bfeta = (\ln \mu_1, \dots, \ln \mu_p)^T\) but the following is also possible: \begin{align} \eta_j &= \ln \frac{\mu_j}{1-\sum_{i=1}^{p-1}\mu_i}, \eta_p = 0,\\ s &= 1, \\ f_j(\bfeta) &= \frac{\exp(\eta_j)}{\sum_{i=1}^{p} \exp(\eta_i)} \quad \text{(softmax function)}, \\ g(\bfeta) &= \ln\left(\sum_{i=1}^p \exp(\eta_i) \right), \\ h(\bfeta, s) &= 1. \end{align} We obtained multi-class logistic regression with \(\hat\bfy (\bfx) = \operatorname{softmax}(\bfW^T\bfx)\) with multi-class cross entropy loss \[\ln p(\bfY | \bfX, \bfW) = \sum_{n=1}^N \sum_{i=1}^p y_{ni} \ln \hat y_{ni}.\] Lastly, \[\frac{\partial f_j}{\partial a_i}(\bfa) = \frac{\delta_{ij}\exp(a_i) - \exp(a_j)\exp(a_i)}{\left( \sum_{k=1}^p \exp(a_k) \right)^2 } = \hat y_j(\delta_{ij} - \hat y_i).\] The same link and activation function work for the Multinomial distribution, but with \(g(\bfeta) = n \ln\left(\sum_{i=1}^p \exp(\eta_i)\right)\) and \(h(y) = \frac{n!}{\prod_{i=1}^p y_i}.\)
Poisson.
\[p(y|\lambda) = \frac{\lambda^y}{y!}\exp(-\lambda), \quad y \in \mathbb{N}.\] It is easy to rewrite \begin{align} \eta &= \ln \lambda, \\ s &= 1, \\ g(\eta) &= \exp(\eta), \\ h(y, s) &= \frac{1}{y!}. \end{align} We have \(\psi = \ln\), \(f = \exp\). As expected we obtain Poisson regression \(\hat y(\bfx) = \exp(\bfw^T\bfx)\) with \[ \ln p(\bfy | \bfx, \bfw) = \sum_{n=1}^N \left( y_n \ln \hat y_n - \hat y_n - \ln y_n!\right).\] Lastly, \[f'(a) = \exp(a) = \hat y.\]