Model

A generalized linear model (GLM) consists of an exponential family of distributions specfied (in its canonical form) by parameters $\bfeta$, functions $h, g, \bfu$ and dispersion parameter $s$ such that \[p(\bfy | \bfeta, s) = h(\bfy, s)\exp\left(\frac{1}{s}\left(\bfeta^T \bfu(\bfy)- g(\bfeta)\right)\right),\] a linear predictor \[\bfa = \bfW^T \bfx \in \R^{p};\] and a link function $f^{-1}$ such that \[\Exp{\bfy | \bfx} = f(\bfa) = f(\bfw^T \bfx) \in \R^p.\]

Variable $\bfy \in \R^p$ is to be predicted from $\bfx \in \R^d$, we assume a distribution on the target variable (an exponential family) and link it to a linear predictor $\bfx^T\bfw$ by function $f^{-1}$. The function $f$ is called activation function. Note that $\bfeta$ can be thought of a function of input $\bfx$, $\bfeta = \bfeta(\bfx).$

Deriving the Derivatives

From now on we assume that $\bfu = \operatorname{id}$. In this case $\bfeta$ is called canonical parameter. For all exponential families it holds that \[\hat \bfy := f(\bfW^T \bfx) = \Exp{\bfy|\bfeta} = \Exp{\bfu(\bfy)|\bfeta} = \nabla g(\bfeta) .\] Therefore, there must exist a functional relation \[\bfeta = \psi(\hat\bfy).\] Now we can write the log likelihood as \begin{align} \ln p (\bfy | \bfeta, s) &= \frac{1}{s} \left(\bfeta^T \bfy - g(\bfeta)\right) + \ln h(\bfy,s)\\ &= \frac{1}{s} \left(\psi(f(\bfa))^T \bfy - g(\psi(f(\bfa))) \right) + \ln h(\bfy,s). \end{align} Now it would be nice if we can choose the link function such that $\psi \circ f = \operatorname{id}$ and obtain \[\ln p (\bfy | \bfeta, s) =\frac{1}{s} \left(\bfa^T \bfy - g(\bfa) \right) + \ln h(\bfy,s).\] We then get \[\bfeta = \psi(\hat\bfy) = \psi(f(\bfa)) = \bfa = \bfW^T \bfx, \quad f^{-1} = \psi.\] Differentiating with respect to $\bfW$ gives \begin{align} \nabla_\bfW (-\ln p (\bfy | \bfeta, s)^T) &= \frac{1}{s}\nabla g(\bfa)^T \d \bfa(\bfW) - \frac{1}{s} \bfy^T \d \bfa(\bfW) \\ &= \frac{1}{s}\hat\bfy^T \d \bfa(\bfW) - \frac{1}{s} \bfy^T \d \bfa(\bfW) \\ &= \frac{1}{s}(\hat\bfy - \bfy)^T \d \bfa(\bfW) . \end{align}

The gradient can be written in matrix form \[ \nabla_\bfW (-\ln p (\bfy | \bfeta, s)) = \frac{1}{s}\bfx^T(\hat\bfy - \bfy) \in \R^{d \times p}\]

As $\bfW = (\bfw_1, \dots, \bfw_p) \in \R^{d \times p} \cong \R^{dp}$ we have $\d \bfa(\bfW) \in \R^{p \times (dp)}.$ But as the $j$-th row of $\bfa$ is only dependent on the $j$-th row of $\bfW^T$ we can write \begin{align} \d \bfa(\bfW) = \begin{pmatrix} (\nabla_{\bfw_1} a_1)^T & 0 & \cdots & 0\\ 0 & (\nabla_{\bfw_2} a_2)^T & \cdots & 0\\ \vdots &\vdots & \ddots & \vdots \\ 0 & 0 & \cdots & (\nabla_{\bfw_p} a_p)^T \end{pmatrix} \end{align} and as all $\nabla_{\bfw_j} a_j = \bfx$ the resulting gradient can be written as $\frac{1}{s}\bfx(\hat\bfy - \bfy)^T.$

$\Box$

The hessian can be computed in terms of following blocks: \[\nabla_{\bfw_i}\nabla_{\bfw_j} (-\ln p (\bfy | \bfeta, s)) = \frac{1}{s}\frac{\partial f_j}{\partial a_i} (\bfa) \bfx \bfx^T\]

\begin{align} \nabla_{\bfW}(\hat y_j - y_j)^T = \d f_j(\bfa) \d\bfa (\bfW) = (\nabla f_j(\bfa)_1\bfx^T, \dots, \nabla f_j(\bfa)_p \bfx^T) \end{align} and thus \begin{align} \nabla_{\bfw_i}(\hat y_j - y_j) = \d f_j(\bfa) \d\bfa (\bfW) = \nabla f_j(\bfa)_i\bfx \end{align} which results in \begin{align} \nabla_{\bfw_i}\nabla_{\bfw_j} (-\ln p (\bfy | \bfeta, s)) = \frac{1}{s}\nabla_{\bfw_i}(\hat y_j - y_j) \bfx= \frac{1}{s}\frac{\partial f_j}{\partial a_i}(\bfa) \bfx \bfx^T. \end{align}

$\Box$

Now we can use the same Newton-Raphson optimisation algorithm for several distributions, \begin{align} \overline\bfW_\text{new} = \overline\bfW_\text{old} - \bfH^{-1}\nabla_{\overline\bfW} \ln p(\bfY | \bfX), \end{align} where the matrix $\bfW$ is in its vectorised form $\overline \bfW \in \R^{dp}.$

For independent observations $\bfX = (\bfx_1, \dots, \bfx_N)^T \in \R^{n \times d}$, $\bfY = (\bfy_1, \dots, \bfy_N)^T \in \R^{n \times p}$, where $(\bfx_j)_0 = 1$ if we use an intercept, the likelihood factorises and thus the batch gradients and hessian are given by \begin{align} -\nabla_{\bfW}\ln p(\bfY | \bfX) &=\frac{1}{s}\sum_{n=1}^N \bfx_n^T(\hat\bfy_n - \bfy_n) \in \R^{d \times p} \\ &= \frac{1}{s}\bfX (\hat\bfY - \bfY), \\ -\nabla_{\bfw_i}\nabla_{\bfw_j} \ln p (\bfY | \bfX) &= \frac{1}{s}\sum_{n=1}^N \frac{\partial f_j}{\partial a_i}(\bfa_n) \bfx_n \bfx_n^T\\ &= \bfX^T \bfR \bfX, \\& \quad \text{where} \quad \bfR = \frac{1}{s}\text{diag}\left(\frac{\partial f_j}{\partial a_i}(\bfa_n), n=1,\dots,N\right). \end{align}

Examples.

Univariate Gaussian.

Here we have \[p(y|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(y-\mu)^2}{2\sigma^2} \right).\] This can be rewritten with \begin{align} \eta &= \mu, \\ s &= \sigma^2, \\ g(\eta) &= \frac{\eta^2}{2}, \\ h(y, s) &= \frac{1}{\sqrt{2\pi s}}\exp\left(-\frac{y^2}{2s}\right). \end{align} We get $\hat y = \eta$ and therefore $\psi = \operatorname{id}$ and $f = \operatorname{id}.$ We obtained univariate linear regression $\hat y (\bfx) = \bfw^T\bfx!$ The associated loss function is the mean squared error \[\ln p(\bfy | \bfx, \sigma^2, \bfw) = \frac{1}{\sigma^2} \sum_{n=1}^N \left(y_n - \hat y(\bfx)\right)^2 + \text{const.}\]

Isotropic Mulitvariate Gaussian.

\[p(\bfy | \bfmu, \sigma^2) = \frac{1}{(2\pi \sigma^2)^{d/2}} \exp\left(-\frac{1}{2\sigma^2}(\bfy - \bfmu)^T(\bfy - \bfmu)\right). \] It is easy to see that \begin{align} \bfeta &= \bfmu, \\ s &= \sigma^2, \\ g(\bfeta) &= \frac{1}{2}\bfmu^T\bfmu, \\ h(\bfy, s) &= \frac{1}{(2\pi s)^{d/2}} \exp \left(-\frac{1}{2s}\bfy^T\bfy\right). \end{align} Again we have $\hat \bfy = \bfeta$ and obtained multivariate linear regression $\hat\bfy(\bfx) = \bfW^T\bfx$ with mean squared loss \[\ln p(\bfY | \bfY, \sigma^2, \bfW) = \frac{1}{\sigma^2} \sum_{n=1}^N ||\bfy_n - \hat y(\bfx)||^2_2 + \text{const.}\]

Bernoulli.

\[p(y|\mu) = \mu^y (1-\mu)^{1-y}, \quad y \in \{0,1\}\] Write the mass function in form of \[p(y|\mu) = \exp\left(y\ln \mu + (1-y)\ln(1-\mu)\right) = \exp\left(\ln\left(\frac{\mu}{1-\mu}\right)y +\ln(1-\mu)\right)\] \begin{align} \eta &= \ln\left(\frac{\mu}{1-\mu}\right) \quad \text{(logit function)}, \\ s &= 1, \\ f(\eta) &= \frac{1}{1 + \exp(-\eta)} \quad \text{(sigmoid function $\sigma$)},\\ g(\eta) &= \ln(1 + \exp(\eta)),\\ h(\eta, s) &= 1. \end{align} We got two-class logistic regression $\hat y(\bfx) = \sigma(\bfw^T\bfx)$! Here we have the cross entropy loss function \[\ln p(\bfy | \bfx, \bfw) = \sum_{n=1}^N y_n \ln \hat y_n + (1-y)\ln(1-\hat y_n).\] Lastly, \[f'(a) = \frac{\exp(-a)}{(1+\exp(-a))^2} = \frac{1}{1+\exp(-a)}\left(1-\frac{1}{1+\exp(-a)} \right) = \hat y (1 - \hat y). \] The same link and activation function work for the Binomial distribution, but with $g(\eta) = n \ln(1+\exp(\eta))$ and $h(y) = \left(\frac{n}{y} \right).$

Categorical.

\[p(\bfy | \bfmu) = \prod_{i=1}^p \mu_i^{y_i}, \quad \text{$\bfy$ is onehot enoding.} \] This density function can be expressed as \[p(\bfy | \bfmu) = \exp\left(\sum_{i=1}^p y_i \ln \mu_i\right).\] An easy variant would be $\bfeta = (\ln \mu_1, \dots, \ln \mu_p)^T$ but the following is also possible: \begin{align} \eta_j &= \ln \frac{\mu_j}{1-\sum_{i=1}^{p-1}\mu_i}, \eta_p = 0,\\ s &= 1, \\ f_j(\bfeta) &= \frac{\exp(\eta_j)}{\sum_{i=1}^{p} \exp(\eta_i)} \quad \text{(softmax function)}, \\ g(\bfeta) &= \ln\left(\sum_{i=1}^p \exp(\eta_i) \right), \\ h(\bfeta, s) &= 1. \end{align} We obtained multi-class logistic regression with $\hat\bfy (\bfx) = \operatorname{softmax}(\bfW^T\bfx)$ with multi-class cross entropy loss \[\ln p(\bfY | \bfX, \bfW) = \sum_{n=1}^N \sum_{i=1}^p y_{ni} \ln \hat y_{ni}.\] Lastly, \[\frac{\partial f_j}{\partial a_i}(\bfa) = \frac{\delta_{ij}\exp(a_i) - \exp(a_j)\exp(a_i)}{\left( \sum_{k=1}^p \exp(a_k) \right)^2 } = \hat y_j(\delta_{ij} - \hat y_i).\] The same link and activation function work for the Multinomial distribution, but with $g(\bfeta) = n \ln\left(\sum_{i=1}^p \exp(\eta_i)\right)$ and $h(y) = \frac{n!}{\prod_{i=1}^p y_i}.$

Poisson.

\[p(y|\lambda) = \frac{\lambda^y}{y!}\exp(-\lambda), \quad y \in \mathbb{N}.\] It is easy to rewrite \begin{align} \eta &= \ln \lambda, \\ s &= 1, \\ g(\eta) &= \exp(\eta), \\ h(y, s) &= \frac{1}{y!}. \end{align} We have $\psi = \ln$, $f = \exp$. As expected we obtain Poisson regression $\hat y(\bfx) = \exp(\bfw^T\bfx)$ with \[ \ln p(\bfy | \bfx, \bfw) = \sum_{n=1}^N \left( y_n \ln \hat y_n - \hat y_n - \ln y_n!\right).\] Lastly, \[f'(a) = \exp(a) = \hat y.\]

Generalized Linear Models