Appendix C — GLMM log-likelihood

The log-likelihood for a linear mixed model (LMM) was derived in Section B.7, where some computational methods for fitting such models with MixedModels.jl, by optimizing a profiled log-likelihood, were illustrated.

In this appendix we outline the evaluation of the log-likelihood for a generalized linear mixed model (GLMM) with a binary response, which is modelled using the Bernoulli distribution.

C.1 The Bernoulli GLMM

The Bernoulli GLMM model defines the conditional distribution \(({{\mathcal{Y}}}|{{\mathcal{B}}}={{\mathbf{b}}})\) as independent Bernoulli random variables with expected values \({{\boldsymbol{\mu}}}={{\mathbf{g}}}^{-1}({{\boldsymbol{\eta}}})\), where \({{\boldsymbol{\eta}}}={{\mathbf{X}}}{{\boldsymbol{\beta}}}+{{\mathbf{Z}}}{{\mathbf{b}}}\) is the linear predictor and \({\mathbf{g}}^{-1}\) is an inverse link function.

We will use the logit link function, \({\boldsymbol{\eta}}={\mathbf{g}}({\boldsymbol{\mu}})\), defined component-wise from the scalar logit link, \(g\), as \[ \eta_i=g(\mu_i)=\mathrm{logit}(\mu_i)=\log\left(\frac{\mu_i}{1-\mu_i}\right)\quad i=1,\dots,n . \] The inverse link, \({\boldsymbol{\mu}}={\mathbf{g}}^{-1}({\boldsymbol{\eta}})\), is similarly defined component-wise from the inverse of the scalar logit, which is the scalar logistic function, \[ \mu_i=g^{-1}(\eta_i)=\mathrm{logistic}(\eta_i)=\frac{1}{1+e^{-\eta_i}}\quad i=1,\dots,n . \tag{C.1}\] The logit is the canonical link function (Section 6.2.2) for the Bernoulli distribution.

As in the linear mixed model discussed in Section B.7, the \(q\)-dimensional random effects, \({\mathcal{B}}\), are expressed as \({\mathcal{B}}={\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}\,{\mathcal{U}}\) where \({\mathcal{U}}\) has a standard, multivariate Gaussian distribution (Section B.2) \[ {\mathcal{U}}\sim{\mathcal{N}}({\mathbf{0}},{\mathbf{I}}_q) , \tag{C.2}\] with probability density function \[ f_{{\mathcal{U}}}({\mathbf{u}})=\frac{1}{\sqrt{2\pi}^q}e^{-\|{\mathbf{u}}\|^2/2} . \tag{C.3}\]

For a linear mixed model the distribution of these spherical random effects was given as \({\mathcal{U}}\sim({\mathbf{0}},\sigma^2{\mathbf{I}}_q)\) (Equation B.35). A dispersion parameter like \(\sigma^2\) is not present in Equation C.2 because the Bernoulli distribution does not have a separate dispersion parameter — it is entirely determined by its mean.

As is the case for the linear mixed model, the covariance factor, \({\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}\), is sparse and patterned. It is not uncommon in practical examples, such as the one in Section C.4, for \({\boldsymbol{\theta}}\) to be one-dimensional and \({\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}=\theta\,{\mathbf{I}}_q\), to be a scalar multiple of the \(q\times q\) identity matrix.

C.1.1 Log-likelihood for a Bernoulli GLMM

The likelihood for the parameters, \({\boldsymbol{\theta}}\) and \({\boldsymbol{\beta}}\), given the observed data, \({\mathbf{y}}\), is the value of the marginal probability mass function for the response, \({\mathcal{Y}}\), evaluated at \({\mathbf{y}}\), the observed vector of {0,1} responses. We obtain this value by integrating the product of the probability mass function for the conditional distribution, \(({\mathcal{Y}}|{\mathcal{U}}={\mathbf{u}})\), and unconditional density of \({\mathcal{U}}\) (Equation C.3), with respect to \({\mathbf{u}}\).

Recall that the probability mass for a single Bernoulli response can be written as \((1-\mu)^{1-y}\mu^y\), which is the specialization to \(n=1\) of the probability mass function for the binomial distribution \[ \binom{n}{y}(1-\mu)^{n-y}\mu^y ,\quad 0\le\mu\le 1, \quad y\in\{0,\dots,n\} . \] Because the components of the vector-valued conditional distribution, \(({\mathcal{Y}}|{\mathcal{U}}={\mathbf{u}})\), are assumed to be independent, its probability mass function can be written as the product of the probability masses for each component \[ f_{{\mathcal{Y}}|{\mathcal{U}}={\mathbf{u}}}({\mathbf{y}}|{\mathbf{u}})=\prod_{i=1}^n \left[(1-\mu_i)^{1-y_i}\mu_i^{y_i}\right] \quad\mathrm{where}\quad {\boldsymbol{\mu}}={\mathbf{g}}^{-1}({\mathbf{X}}{\boldsymbol{\beta}}+{\mathbf{Z}}{\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}{\mathbf{u}}) , \] providing the likelihood as \[ \begin{aligned} L({\boldsymbol{\eta}},{\boldsymbol{\theta}}|{\mathbf{y}})&= \int_{{\mathbf{u}}}f_{{\mathcal{Y}},{\mathcal{U}}={\mathbf{u}}}({\mathbf{y}},{\mathbf{u}})f_{{\mathcal{U}}}({\mathbf{u}})\,d{\mathbf{u}}\\ &=\int_{{\mathbf{u}}}\frac{1}{\sqrt{2\pi}^q}e^{\sum_{i=1}^n(1-y_i)\log(1-\mu_i)+y_i\,\log(\mu_i)} e^{-\left\|{\mathbf{u}}\right\|^2/2}\,d{\mathbf{u}}\\ &=\int_{{\mathbf{u}}}\frac{1}{\sqrt{2\pi}^q}\exp\left(\frac{\left\|{\mathbf{u}}\right\|^2+\sum_{i=1}^n d(y_i,\mu_i)}{-2}\right)\,d{\mathbf{u}} \end{aligned} \tag{C.4}\] where the unit deviances, \(d(y_i,\mu_i)\), are \[ d(y_i,\mu_i)=-2\left[(1-y_i)\log(1-\mu_i)+y_i\log(\mu_i)\right]\quad i=1,\dots,n . \tag{C.5}\]

By converting from the logarithm of the probability mass function to the deviance scale, which is negative twice the log-probability, we get a quantity, \(\sum_{i=1}^n d(y_i,\mu_i)\), which is on the same scale as the squared length, \(\|{\mathbf{u}}\|^2\), of a standard multivariate Gaussian. The sum of the unit deviances is analogous to the sum of squared residuals, \(\|{\mathbf{y}}-{\mathbf{X}}{\boldsymbol{\beta}}\|^2\), in a linear model.

In Section B.7 we showed that the integral defining the likelihood for a linear mixed model, Equation B.43, has an analytic solution. In general, the integral in Equation C.4 does not. We will approximate the value of this integral using a quadratic approximation to the argument of the exponential function in Equation C.4 at the value of \({\mathbf{u}}\) that maximizes the integrand, which is the density of the conditional distribution, \(({\mathcal{U}}|{\mathcal{Y}}={\mathbf{y}})\), up to a scale factor. Because the scale factor does not affect the location of the maximum, the value of \({\mathbf{u}}\) that maximizes the integrand, \[ \begin{aligned} \tilde{{\mathbf{u}}}({\mathbf{y}}|{\boldsymbol{\theta}},{\boldsymbol{\beta}}) &=\arg\max_{{\mathbf{u}}}\exp\left(\frac{\left\|{\mathbf{u}}\right\|^2 + \sum_{i=1}^n d(y_i,\mu_i)}{-2}\right)\\ &=\arg\min_{{\mathbf{u}}}\left(\left\|{\mathbf{u}}\right\|^2 + \sum_{i=1}^n d(y_i,\mu_i)\right) \end{aligned} , \tag{C.6}\] is also the conditional mode — the value of \({\mathbf{u}}\) that maximizes the conditional density. The expression being minimized in Equation C.6, \(\left\|{\mathbf{u}}\right\|^2 + \sum_{i=1}^n d(y_i,\mu_i)\), is called the penalized deviance.

Using a quadratic approximation to the penalized deviance at this conditional mode (i.e. the mode of the conditional distribution of \({\mathcal{U}}\) given \({\mathcal{Y}}={\mathbf{y}}\)) is equivalent to using a multivariate Gaussian approximation to this conditional distribution. Approximating an integral like Equation C.4 by approximating the integrand as a scaled multivariate Gaussian distribution at its mode is called Laplace’s approximation (Tierney & Kadane (1986)).

The penalized iteratively re-weighted least squares (PIRLS) algorithm (Section C.4) provides a fast and stable method of determining the conditional mode, \(\tilde{{\mathbf{u}}}({\mathbf{y}}|{\boldsymbol{\theta}},{\boldsymbol{\beta}})\) (Equation C.6), thereby making it feasible to use Laplace’s approximation at scale.

Before discussing PIRLS, however, we will describe generalized linear models (GLMs) without random effects (Section C.2), for which the deviance is defined as the sum of the unit deviances and the maximum likelihood estimate of the coefficient vector, \(\widehat{{\boldsymbol{\beta}}}\), is the value that minimizes the deviance. In Section C.3 we describe the iteratively re-weighted least squares (IRLS) algorithm, which is a stable, fast algorithm to minimize the deviance.

We will illustrate the IRLS algorithm with the contra data discussed in Chapter 6 and a model like com05, which was fit in that chapter, but without the random effects. Later we will use the full com05 model to illustrate some of the computations for GLMMs.

Although 0/1 responses and the Bernoulli distribution are easy to describe, the theory of the generalized linear mixed model (GLMM) and the details of the implementation are not. Readers who wish to focus on practical applications more than on the theory should feel free to skim this appendix.

Load the packages to be used

Code
using AlgebraOfGraphics
using BenchmarkTools
using CairoMakie
using EmbraceUncertainty: dataset
using FreqTables
using LinearAlgebra
using MixedModels
using MixedModelsMakie
using NLopt
using PooledArrays
using StatsAPI

and define some constants

Code
@isdefined(contrasts) || const contrasts = Dict{Symbol,Any}()
@isdefined(progress) || const progress = false

C.2 Generalized linear models for binary data

To introduce some terms and workflows we first consider the generalized linear model (GLM) for the Bernoulli distribution and the logit link. The linear predictor for a GLM - a model without random effects - is simply \[ {{\boldsymbol{\eta}}}= {{\mathbf{X}}}{{\boldsymbol{\beta}}} , \] and the mean response vector, \({{\boldsymbol{\mu}}}={\mathbf{g}}^{-1}({\boldsymbol{\eta}})\), is obtained by component-wise application of the scalar logistic function (Equation C.1).

The probability mass function for the Bernoulli distribution is \[ f_{{\mathcal{Y}}}(y|\mu) = \mu^y\,(1-\mu)^{(1-y)}\quad\mathrm{for}\quad y\in\{0,1\} . \]

Because the elements of \({{\mathcal{Y}}}|{{\boldsymbol{\mu}}}\) are assumed to be independent, the log-likelihood is simply the sum of contributions from each element, which, on the deviance scale, can be written in terms of the unit deviances \[ \begin{aligned} -2\,\ell({{\boldsymbol{\mu}}}|{\mathbf{y}})&= -2\,\log(L({{\boldsymbol{\mu}}}|{\mathbf{y}}))\\ &=-2\,\sum_{i=1}^n y_i\log(\mu_i)+(1-y_i)\log(1-\mu_i) . \end{aligned} \tag{C.7}\]

As described above, it is customary when working with GLMs to convert the log-likelihood to a deviance, which, for the Bernoulli distribution, is negative twice the log-likelihood. (For other distributions, the deviance may incorporate an additional term that depends only on \({\mathbf{y}}\).)

One reason for preferring the deviance scale is that the change in deviance for nested models has approximately a \(\chi^2\) distribution with degrees of freedom determined by the number of independent constraints on the parameters in the simpler model. Especially for GLMs, the deviance plays a role similar to the sum of squared residuals in linear models.

For greater numerical precision we avoid calculating \(1-\mu\) directly when evaluating expressions like Equation C.7 and instead use \[ 1 - \mu = 1 - \frac{1}{1+e^{-\eta}}=\frac{e^{-\eta}}{1+e^{-\eta}} . \] Evaluation of the last expression provides greater precision for large negative values of \(\eta\) (corresponding to small values of \(\mu\)) than does first evaluating \(\mu\) followed by \(1 - \mu\).

After some algebra, we write the unit deviance, \(d(y_i,\eta_i)\), which is the contribution to the deviance from the \(i\)th observation, as \[ \begin{aligned} d(y_i, \eta_i)&=-2\left[y_i\log(\mu_i)+(1-y_i)\log(1-\mu_i)\right]\\ &=2\left[(1-y_i)\eta_i-\log(1+e^{-\eta_i})\right] \end{aligned} \quad i=1,\dots,n \]

A Julia function to evaluate both the mean and the unit deviance can be written as

function meanunitdev(y::T, η::T) where {T<:AbstractFloat}
  expmη = exp(-η)
  return (; μ=inv(1 + expmη), dev=2 * ((1 - y) * η + log1p(expmη)))
end

Mathematically log1p, read log of 1 plus, is defined as \(\mathrm{log1p}(x)=\log(1+x)\) but it is implemented in such a way as to provide greater accuracy when \(x\) is small. For example,

let small = eps() / 10
  @show small
  @show 1 + small
  @show log(1 + small)
  @show log1p(small)
end;
small = 2.2204460492503132e-17
1 + small = 1.0
log(1 + small) = 0.0
log1p(small) = 2.2204460492503132e-17

1 + small evaluates to 1.0 in floating point arithmetic because of round-off, producing 0 for the expression log(1 + small), whereas log1p(small) ≈ small, as it should be.

This function returns a NamedTuple of values from scalar arguments. For example,

meanunitdev(0.0, 0.21)
(μ = 0.5523079095743253, dev = 1.6072991620435673)

A Vector of such NamedTuples is a row-table (Section A.2), which can be updated in place by dot-vectorization of the scalar meanunitdev function, as shown below.

C.2.1 An example: fixed-effects only from com05

We illustrate some of these computations using only the fixed-effects specification for com05, a GLMM fit to the contra data set in Chapter 6. Because we will use the full GLMM later we reproduce com05 by loading the data, creating the binary ch variable indicating children/no-children, defining the contrasts and formula to be used, and fitting the model as in Chapter 6.

Code
contra = let tbl = dataset(:contra)
  Table(tbl; ch=tbl.livch .≠ "0")
end
contrasts[:urban] = HelmertCoding()
contrasts[:ch] = HelmertCoding()
com05 =
  let d = contra,
    ds = Bernoulli(),
    f = @formula(
      use ~ 1 + urban + ch * age + age & age + (1 | dist & urban)
    )

    fit(MixedModel, f, d, ds; contrasts, nAGQ=9, progress)
  end

Extract the fixed-effects model matrix, \({\mathbf{X}}\), and initialize the coefficient vector, \({\boldsymbol{\beta}}\), to a copy (in case we modify it) of the estimated fixed-effects.

βm05 = copy(com05.β)
6-element Vector{Float64}:
 -0.3414913998306781
  0.3936080536502067
  0.6064861079468472
 -0.012911714642169572
  0.03321662487439253
 -0.005625046845040066

As stated above, the meanunitdev function can be applied to the vectors, \({\mathbf{y}}\) and \({{\boldsymbol{\eta}}}\), via dot-vectorization to produce a Vector{NamedTuple}, which is the typical form of a row-table.

rowtbl = meanunitdev.(com05.y, com05.X * βm05)
typeof(rowtbl)
Vector{@NamedTuple{μ::Float64, dev::Float64}} (alias for Array{@NamedTuple{μ::Float64, dev::Float64}, 1})

For display we convert the row-table to a column-table and prepend another column-table consisting of \({\mathbf{y}}\) and \({\boldsymbol{\eta}}\).

Table((; y=com05.y, η=com05.X * βm05), rowtbl)  # display as a Table
Table with 4 columns and 1934 rows:
      y    η          μ         dev
    ┌───────────────────────────────────
 1  │ 0.0  -0.87968   0.293244  0.69414
 2  │ 0.0  -0.471786  0.384194  0.969645
 3  │ 0.0  0.676178   0.662885  2.17466
 4  │ 0.0  0.429284   0.605703  1.8613
 5  │ 0.0  -0.963167  0.276245  0.646604
 6  │ 0.0  -0.772821  0.315869  0.759212
 7  │ 0.0  -0.87968   0.293244  0.69414
 8  │ 0.0  0.515028   0.625984  1.96692
 9  │ 0.0  0.371817   0.591898  1.79248
 10 │ 0.0  0.676178   0.662885  2.17466
 11 │ 1.0  -0.772821  0.315869  2.30485
 12 │ 0.0  -0.473145  0.383872  0.968601
 13 │ 0.0  0.449047   0.610413  1.88533
 14 │ 0.0  0.602596   0.64625   2.07833
 15 │ 0.0  0.645468   0.655988  2.13416
 16 │ 1.0  0.637867   0.654271  0.848467
 17 │ 0.0  -0.471786  0.384194  0.969645
 ⋮  │  ⋮       ⋮         ⋮         ⋮

The deviance for this value of \({{\boldsymbol{\beta}}}\) in this model is the sum of the unit deviances, which we write as sum applied to a generator expression. (In general we extract columns of a row-table with generator expressions that produce iterators.)

sum(r.dev for r in rowtbl)
2411.194470806229

C.2.2 Encapsulating the model in a struct

When minimizing the deviance it is convenient to have the different components of the model encapsulated in a user-created struct type so we can update the parameter values and evaluate the deviance without needing to keep track of all the pieces of the model.

struct BernoulliGLM{T<:AbstractFloat}
  X::Matrix{T}
  β::Vector{T}
  ytbl::NamedTuple{(:y, :η),NTuple{2,Vector{T}}}
  rtbl::Vector{NamedTuple{(:μ, :dev),NTuple{2,T}}}
end

We also create an external constructor, which is a function defined outside the struct and of the same name as the struct, that constructs and returns an object of that type. In this case the external constructor creates a BernoulliGLM from the model matrix and the response vector, after some consistency checks on the arguments passed to it.

function BernoulliGLM(
  X::Matrix{T},
  y::Vector{T},
) where {T<:AbstractFloat}

  # check consistency of arguments
  n = size(X, 1)                  # number of rows in X
  if length(y)  n || any(!in([0, 1]), y)
    throw(ArgumentError("y is not an $n-vector of 0's and 1's"))
  end

  # initial β from linear regression of y in {-1,1} coding
  β = X \ replace(y, 0 => -1)
  η = X * β

  return BernoulliGLM(X, β, (; y, η), meanunitdev.(y, η))
end

To optimize the deviance we define an extractor method that returns the deviance

StatsAPI.deviance(m::BernoulliGLM) = sum(r.dev for r in m.rtbl)

This extractor is written as a method for the generic deviance function defined in the StatsAPI package. Doing so allows us to use the deviance name for the extractor without interfering with deviance methods defined for other model types.

We also define a mutating function, setβ!, that installs a new value of β then updates η and rtbl in place.

function setβ!(m::BernoulliGLM, newβ)
  (; y, η) = m.ytbl                # destructure ytbl
  mul!(η, m.X, copyto!(m.β, newβ)) # η = X * newβ in place
  m.rtbl .= meanunitdev.(y, η)     # update rtbl in place
  return m
end

Create such a struct from X and y for model com05.

com05fe = BernoulliGLM(com05.X, com05.y)
β₀ = copy(com05fe.β)          # keep a copy of the initial values
6-element Vector{Float64}:
 -0.10753340611835722
  0.17596632075206525
  0.21944824496978524
 -0.0028547435517881996
  0.010139921834784385
 -0.002161831532409504

These initial values of \({\boldsymbol{\beta}}\) are from a least squares fit of \({\mathbf{y}}\), converted from {0,1} coding to {-1,1} coding, on the model matrix, \({\mathbf{X}}\).

As a simple test of the setβ! and deviance methods we can check that com05fe produces the same deviance value for βm05 as was evaluated above.

deviance(setβ!(com05fe, βm05))
2411.194470806229

For fairness in later comparisons we restore the initial values β₀ to the model. These are rough starting estimates with a deviance that is considerably greater than that at βm05.

deviance(setβ!(com05fe, β₀))
2491.1514390537254

C.2.3 Fit the GLM using a general optimizer

We can use a general optimizer like those available in NLopt.jl to minimize the deviance. Following the instructions given at that package’s repository, we create an Opt object specifying the algorithm to be used, BOBYQA (Powell (2009)), and the dimension of the problem, then define and assign the objective function in the required form, and call optimize

function StatsAPI.fit!(m::BernoulliGLM{T}) where {T}
  opt = Opt(:LN_BOBYQA, length(m.β))
  function objective(x::Vector{T}, g::Vector{T}) where {T}
    isempty(g) || throw(
      ArgumentError("Gradient not available, g must be empty"),
    )
    return deviance(setβ!(m, x))
  end
  opt.min_objective = objective
  minf, minx, ret = optimize(opt, copy(m.β))
  @info (; code=ret, nevals=opt.numevals, minf)
  return m
end
fit!(com05fe);
[ Info: (code = :ROUNDOFF_LIMITED, nevals = 558, minf = 2409.377428160017)

The optimizer has determined a coefficient vector that reduces the deviance to 2409.38, at which point convergence was declared because changes in the objective are limited by round-off. This required about 500 evaluations of the deviance at candidate values of \({\boldsymbol{\beta}}\).

Each evaluation of the deviance is fast, requiring only a fraction of a millisecond on a laptop computer,

βopt = copy(com05fe.β)
@benchmark deviance(setβ!(m, β)) seconds = 1 setup =
  (m = com05fe; β = βopt)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  33.798 μs … 113.439 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     36.778 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   36.707 μs ±   1.706 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂        ▁      ▁▃▆▂      ▄▇█▂        ▁▁                    ▂
  ▆█▇▁▁▁▁▁▆▇█▅▄▁▃▄▃████▄▁▁▃▃▅████▇▅▄▆▅▄▆▆███▇▅▆▅▅▅▃▁▄▄▃▃▃▄▅▅▆▆ █
  33.8 μs       Histogram: log(frequency) by time        40 μs <

 Memory estimate: 16 bytes, allocs estimate: 1.

but the already large number of evaluations for these six coefficients would not scale well as this dimension increases.

Fortunately there is an algorithm, called iteratively reweighted least squares (IRLS), that uses the special structure of the GLM to provide fast and stable convergence to estimates of the coefficients, even for models with a large number of coefficients. This will be important to us in fitting GLMMs where we must optimize with respect to the random effects, whose dimension can be large.

C.3 The IRLS algorithm

As we have seen, in a GLM we are modeling the responses and the predicted values on two scales — the linear predictor scale, for \({\boldsymbol{\eta}}\), and the response scale, for \({\mathbf{y}}\) and \({\boldsymbol{\mu}}\). The scalar link function, \(\eta=g(\mu)\), and the inverse link, \(\mu=g^{-1}(\eta)\), map vectors component-wise between these two scales.

For operations like determining a new candidate value of \({\boldsymbol{\beta}}\), the linear predictor scale is preferred, because, on that scale, \({\boldsymbol{\eta}}={\mathbf{X}}{\boldsymbol{\beta}}\) is a linear function of \({\boldsymbol{\beta}}\). Thus it would be convenient if we could transform the response, \({\mathbf{y}}\), to the linear predictor scale where we could define a residual and use some form of minimizing a sum of squared residuals to evaluate a new coefficient vector (or, alternatively, evaluate an increment that will be added to the current coefficient vector). Unfortunately, a naive approach of transforming \({\mathbf{y}}\) to the linear predictor scale won’t work because the elements of \({\mathbf{y}}\) are all \(0\) or \(1\) and the logit link function maps these values to \(-\infty\) and \(\infty\), respectively.

For an iterative algorithm, however, we can use a local linear approximation to the link function to define a working residual, from which to evaluate an increment to the coefficient vector, or a working response, from which we evaluate the new coefficient vector directly. Because the link and inverse link functions are defined component-wise we will define the approximation for scalars \(y_i\), \(\mu_i\), and \(\eta_i\) and for the scalar link function, \(g\), with the understanding that these definitions apply component-wise to the vectors.

The working residual is evaluated by mapping the residual on the response scale, \(y_i-\mu_i\), through the linear approximation to the link, \(g(\mu)\), at \(\mu_i\). That is, \[ \tilde{r_i}=(y_i-\mu_i)g'(\mu_i)\quad i=1,\dots,n . \] Because the derivative, \(g'(\mu_i)\), for the logit link function is \(1/[\mu_i(1-\mu_i)]\), these working residuals are \[ \tilde{r}_i = (y_i-\mu_i)g'(\mu_i) = \frac{y_i - \mu_i}{\mu_i(1-\mu_i)}\quad i=1,\dots,n . \] Similarly, the working response on the linear predictor scale, is defined by adding the working residual to the current linear predictor value, \[ \tilde{y_i}=\eta_i + \tilde{r_i}=\eta_i +(y_i-\mu_i)g'(\mu_i)= \eta_i + \frac{y_i - \mu_i}{\mu_i(1-\mu_i)}\quad i=1,\dots,n . \]

On the linear predictor scale we can fit a linear model to the working response to obtain a new parameter vector, but we must take into account that the variances of the noise terms in this linear model, which are the working residuals, are not constant. We use weighted least squares where the weights are inversely proportional to the variance of the working residual. The variance of the random variable \({\mathcal{Y}}_i\) is \(\mu_i(1-\mu_i)\), hence the variance of the working residual is \[ \mathrm{Var}(\tilde{r_i})=g'(\mu_i)^2 \mathrm{Var}({\mathcal{Y}}_i)=\frac{\mu_i(1-\mu_i)}{\left[\mu_i(1-\mu_i)\right]^2}=\frac{1}{\mu_i(1-\mu_i)} \quad i=1,\dots,n . \]

Thus the working weights are \[ \begin{aligned} w_i&=\mu_i(1-\mu_i)\\ &=\frac{1}{1+e^{-\eta_i}}\frac{e^{-\eta_i}}{1+e^{-\eta_i}} \end{aligned} ,\quad i=1,\dots,n. \]

In practice we will use the square roots of the working weights, evaluated as \[ \sqrt{w_i}=\frac{\sqrt{e^{-\eta_i}}}{1+e^{-\eta_i}}=\frac{e^{-\eta_i/2}}{1+e^{-\eta_i}}\quad i=1,\dots,n . \]

Note that \(\mathrm{Var}({\mathcal{Y}}_i)\) happens to be the inverse of \(g'(\mu_i)\) for a Bernoulli response and the logit link function. This will always be true for distributions in the exponential family and their canonial links.

At the \(k\)th iteration the IRLS algorithm updates the coefficient vector to \({\boldsymbol{\beta}}^{(k)}\), which is a weighted least squares solution that could be written as \[ {\boldsymbol{\beta}}^{(k)}= \left({\mathbf{X}}'{\mathbf{W}}{\mathbf{X}}\right)^{-1}\left({\mathbf{X}}'{\mathbf{W}}\tilde{{\mathbf{y}}}\right) , \] where \({\mathbf{W}}\) is an \(n\times n\) diagonal matrix of the working weights and \(\tilde{{\mathbf{y}}}\) is the working response, both evaluated at \({\boldsymbol{\beta}}^{(k-1)}\), the coefficient vector from the previous iteration.

In practice we use the square roots of the working weights, which we write as a diagonal matrix, \({\mathbf{W}}^{1/2}\), and a QR decomposition (Section B.6) of a weighted model matrix, \({\mathbf{W}}^{1/2}{\mathbf{X}}\), to solve for the updated coefficient vector from the weighted working response, \({\mathbf{W}}^{1/2}\tilde{{\mathbf{y}}}\), with elements \[ \begin{aligned} \sqrt{w_i}(\eta_i+\tilde{r}_i)&=\sqrt{\mu_i(1-\mu_i)}(\eta_i+\tilde{r}_i)\\ &=\sqrt{w_i}\eta_i +\frac{(y_i-\mu_i)\sqrt{\mu_i(1-\mu_i)}}{\mu_i(1-\mu_i)}\\ &=\sqrt{w_i}\eta_i +\frac{y_i-\mu_i}{\sqrt{w_i}} \end{aligned},\quad i=1,\dots,n \]

It is possible to write the IRLS algorithm using a weighted least squares fit of the working residuals on the model matrix to determine a parameter increment. However, in the PIRLS algorithm it is necessary to use the working response, not the working residual, so we define the IRLS algorithm in those terms too.

Furthermore, in the PIRLS algorithm we will need to allow for an offset when calculating the working response. In the presence of an offset, \({\mathbf{o}}\), a constant vector of length \(n\), the linear predictor is defined as \[ {\boldsymbol{\eta}}= {\mathbf{o}}+ {\mathbf{X}}{\boldsymbol{\beta}}. \] The mean, \({\boldsymbol{\mu}}\), the working weights and the working residuals are defined as before but the working response becomes \[ \tilde{{\mathbf{y}}}=\tilde{{\mathbf{r}}} + {\boldsymbol{\eta}}- {\mathbf{o}}. \]

For a linear model there is rarely a reason for using an offset. Instead we can simply subtract the constant vector, \({\mathbf{o}}\), from the response, \({\mathbf{y}}\), because the response and the linear predictor are on the same scale. However, this is not the case for a GLM where we must deal with the effects of the constant offset on the linear predictor scale, not on the response scale.

C.3.1 Implementation of IRLS for Bernoulli-Logit

We define a BernoulliIRLS struct with three additional elements in the rowtable: the square roots of the working weights, rtwwt, the weighted working residuals, wwres, and the weighted working response, wwresp. In the discussion above, rtwwt is the diagonal of \({\mathbf{W}}^{1/2}\), wwres is \({\mathbf{W}}^{1/2}\tilde{{\mathbf{r}}}\) and wwresp is \({\mathbf{W}}^{1/2}\tilde{{\mathbf{y}}}\).

We also add fields Xqr, in which the weighted model matrix, \({\mathbf{W}}^{1/2}{\mathbf{X}}\), is formed followed by its QR decomposition, and βcp, which holds a copy of the previous coefficient vector.

struct BernoulliIRLS{T<:AbstractFloat}
  X::Matrix{T}
  Xqr::Matrix{T}                # copy of X used in the QR decomp
  β::Vector{T}
  βcp::Vector{T}                # copy of previous β
  Whalf::Diagonal{T,Vector{T}}  # rtwwt as a Diagonal matrix
  ytbl::NamedTuple{(:y, :η),NTuple{2,Vector{T}}}
  rtbl::Vector{
    NamedTuple{(:μ, :dev, :rtwwt, :wwres, :wwresp),NTuple{5,T}},
  }
end

with constructor

function BernoulliIRLS(
  X::Matrix{T},
  y::Vector{T},
) where {T<:AbstractFloat}
  n = size(X, 1)          # number of rows of X
  if length(y)  n || !all(v -> (iszero(v) || isone(v)), y)
    throw(ArgumentError("y is not an $n-vector of 0's and 1's"))
  end
  # initial β from linear least squares fit of y in {-1,1} coding
  Xqr = copy(X)
  β = qr!(Xqr) \ replace(y, 0 => -1)
  βcp = copy(β)
  η = X * β
  rtbl = tblrow.(y, η)
  Whalf = Diagonal([r.rtwwt for r in rtbl])
  return BernoulliIRLS(X, Xqr, β, βcp, Whalf, (; y, η), rtbl)
end

The tblrow function evaluates the mean, unit deviance, square root of the weight, and the weighted, working residual and weighted, working response for scalar \(y\) and \(\eta\). The offset argument, which defaults to zero, is not used in calls for BernoulliIRLS models, but will be used in Section C.4 when we discuss the PIRLS algorithm.

function tblrow(
  y::T,
  η::T,
  offset::T=zero(T),
) where {T<:AbstractFloat}
  rtexpmη = exp(-η / 2)      # square root of exp(-η)
  expmη = abs2(rtexpmη)      # exp(-η)
  denom = 1 + expmη
  μ = inv(denom)
  dev = 2 * ((1 - y) * η + log1p(expmη))
  rtwwt = rtexpmη / denom    # sqrt of working wt
  wwres = (y - μ) / rtwwt    # weighted working resid
  wwresp = wwres + rtwwt *- offset)
  return (; μ, dev, rtwwt, wwres, wwresp)
end
StatsAPI.deviance(m::BernoulliIRLS) = sum(r.dev for r in m.rtbl)

Next we define a mutating function, updateβ!, that evaluates \({\boldsymbol{\beta}}^{(k)}\), the updated coefficient vector at iteration \(k\), in place by weighted least squares then updates the response table.

function updateβ!(m::BernoulliIRLS)
  (; X, Xqr, β, βcp, Whalf, ytbl, rtbl) = m  # destructure m & ytbl
  (; y, η) = ytbl
  copyto!(βcp, β)                            # keep a copy of β
  copyto!(Whalf.diag, r.rtwwt for r in rtbl) # rtwwt -> Whalf
  mul!(Xqr, Whalf, X)                        # weighted model matrix
  copyto!(η, r.wwresp for r in rtbl)         # use η as temp storage
  ldiv!(β, qr!(Xqr), η)                      # weighted least squares
  rtbl .= tblrow.(y, mul!(η, X, β))          # update η and rtbl
  return m
end

For our example, we start at the same coefficient vector as we did with the general optimizer.

com05fe = BernoulliIRLS(com05.X, com05.y)
deviance(com05fe)
2491.1514390537254

The first IRLS iteration

deviance(updateβ!(com05fe))
2411.165742459929

reduces the deviance substantially.

We create a fit! method to iterate to convergence.

function StatsAPI.fit!(m::BernoulliIRLS, β₀=m.β; verbose::Bool=true)
  (; X, β, βcp, ytbl, rtbl) = m
  (; y, η) = ytbl
  rtbl .= tblrow.(y, mul!(η, X, copyto!(β, β₀)))
  olddev = deviance(m)
  verbose && @info 0, olddev   # record the deviance at initial β
  for i in 1:100               # perform at most 100 iterations
    newdev = deviance(updateβ!(m))
    verbose && @info i, newdev # iteration number and deviance
    if newdev > olddev
      @warn "failure to decrease deviance"
      copyto!(β, βcp)          # roll back changes to β, η, and rtbl
      rtbl = tblrow.(y, mul!(η, X, β))
      break
    elseif (olddev - newdev) < (1.0e-10 * olddev)
      break                    # exit loop if deviance is stable
    else
      olddev = newdev
    end
  end
  return m
end
fit!(com05fe, β₀);
[ Info: (0, 2491.1514390537254)
[ Info: (1, 2411.165742459929)
[ Info: (2, 2409.382451337132)
[ Info: (3, 2409.3774282835857)
[ Info: (4, 2409.3774281600413)

The IRLS algorithm has converged in 4 iterations to essentially the same deviance as the general optimizer achieved after around 500 function evaluations. Each iteration of the IRLS algorithm takes more time than a deviance evaluation, but still only a fraction of a millisecond on a laptop computer.

@benchmark deviance(updateβ!(m)) seconds = 1 setup = (m = com05fe)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  67.528 μs … 161.118 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     76.638 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   76.986 μs ±   2.640 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                         ▂▂▂▄▅▅▄▇██▆▅▄▄▅▅▄▃▂▂▂▂▂▁▁▁            ▂
  ▅▅▆▆▇▇▇▆▆▆▆▆▆▆▅▆▇▇▇██▇██████████████████████████████▇▇▇▇▇▇▇▇ █
  67.5 μs       Histogram: log(frequency) by time      84.5 μs <

 Memory estimate: 16.12 KiB, allocs estimate: 6.

C.4 GLMMs and the PIRLS algorithm

In Section B.7 we showed that, given a value of \({\boldsymbol{\theta}}\), which determines the relative covariance factor, \({\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}\), of the random effects, \({\mathcal{B}}\), the conditional mode, \(\tilde{{\mathbf{b}}}\), of the random effects can be evaluated as the solution to a penalized least squares (PLS) problem. It is convenient to write the PLS problem in terms of the spherical random effects, \({\mathcal{U}}\sim{\mathcal{N}}({\mathbf{0}},\sigma^2{\mathbf{I}})\), with the defining relationship \({\mathcal{B}}={\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}{\mathcal{U}}\), as in Equation B.38 \[ \tilde{{\mathbf{u}}}=\arg\min_{{\mathbf{u}}}\left( \left\|{\mathbf{y}}-{\mathbf{X}}{\boldsymbol{\beta}}-{\mathbf{Z}}{\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}{\mathbf{u}}\right\|^2 + \left\|{\mathbf{u}}\right\|^2 \right) . \]

We wrote Equation B.38 for the LMM case as minimizing the penalized sum of squared residuals with respect to both \({\boldsymbol{\beta}}\) and \({\mathbf{u}}\). Here, and in Equation C.8 below, we minimize with respect to \({\mathbf{u}}\) only while holding \({\boldsymbol{\beta}}\) fixed.

The solution of this PLS problem, \(\tilde{\mathbf{u}}\), is the conditional mode of \({\mathcal{U}}\), in that it maximizes the density of the conditional distribution, \(({\mathcal{U}}|{\mathcal{Y}}={\mathbf{y}})\), at the observed \({\mathbf{y}}\). (In the case of a LMM, where the conditional distributions, \(({\mathcal{B}}|{\mathcal{Y}}={\mathbf{y}})\) and \(({\mathcal{U}}|{\mathcal{Y}}={\mathbf{y}})\), are multivariate Gaussian, the solution of the PLS problem is also the mean of the conditional distribution, but this property doesn’t carry over to GLMMs.)

In a Bernoulli generalized linear mixed model (GLMM) the mode of the conditional distribution, \(({\mathcal{U}}|{\mathcal{Y}}={\mathbf{y}})\), minimizes the penalized GLM deviance, \[ \tilde{{\mathbf{u}}}=\arg\min_{{\mathbf{u}}}\left( \left\|{\mathbf{u}}\right\|^2+\sum_{i-1}^n d(y_i,\eta_i({\mathbf{u}})) \right) , \tag{C.8}\] where \(d(y_i,\eta_i),\,i=1,\dots,n\) are the unit deviances defined in Section C.2. We modify the IRLS algorithm as penalized iteratively re-weighted least squares (PIRLS) to determine these values.

As with IRLS, each iteration of the PIRLS algorithm involves using the current linear predictor, \({\boldsymbol{\eta}}({\mathbf{u}})={\mathbf{X}}{\boldsymbol{\beta}}+{\mathbf{Z}}{\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}{\mathbf{u}}\), (\({\boldsymbol{\beta}}\) and \({\boldsymbol{\theta}}\) are assumed known and fixed, and \({\mathbf{X}}{\boldsymbol{\beta}}\) is an offset) to evaluate the mean, \({\boldsymbol{\mu}}={\mathbf{g}}^{-1}({\boldsymbol{\eta}})\), of the conditional distribution, \(({\mathcal{Y}}|{\mathcal{U}}={\mathbf{u}})\), as well as the unit deviances, \(d(y_i,\eta_i)\), the square roots of the working weights, which are on the diagonal of \({\mathbf{W}}^{1/2}\), and the weighted, working response, \({\mathbf{W}}^{1/2}\tilde{{\mathbf{y}}}\). The updated spherical random effects vector, \({\mathbf{u}}\), is the solution to \[ ({\boldsymbol{\Lambda}}'{\mathbf{Z}}'{\mathbf{W}}{\mathbf{Z}}{\boldsymbol{\Lambda}}+{\mathbf{I}}){\mathbf{u}}={\boldsymbol{\Lambda}}'{\mathbf{Z}}'{\mathbf{W}}\tilde{{\mathbf{y}}} \] and is evaluated using the Cholesky factor, \({\mathbf{L}}\), of \({\boldsymbol{\Lambda}}'{\mathbf{Z}}'{\mathbf{W}}{\mathbf{Z}}{\boldsymbol{\Lambda}}+{\mathbf{I}}\).

As in the solution of the PLS problem in Section B.7, the fact that \({\mathbf{Z}}\) is sparse and that the sparsity is also present in \({\mathbf{L}}\), makes it feasible to solve for \({\mathbf{u}}\) even when its dimension is large.

C.4.1 PIRLS for com05

To illustrate the calculations we again use the com05 model, which has a single, scalar random-effects term, (1 | dist & urban), in its formula. The matrix \({\mathbf{Z}}\) is displayed as

com05re = only(com05.reterms)
Int.(collect(com05re))  # Int values for compact printing
1934×102 Matrix{Int64}:
 1  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0
 1  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0
 1  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 ⋮              ⋮              ⋮        ⋱  ⋮              ⋮              ⋮  
 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  1
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  1
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  1
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  1
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  1
 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  1
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  1
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  1
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  1

but internally it is stored much more compactly because it is an indicator matrix (also called one-hot encoding), which means that all \(Z_{i,j}\in\{0,1\}\) and in each row there is exactly one value that is one (and all the others are zero). The column in which the non-zero element of each row occurs is given as an integer vector in the refs property of com05re.

com05re.refs'      # transpose for compact printing
1×1934 adjoint(::Vector{Int32}) with eltype Int32:
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  102  102  102  102  102  102  102

We define a struct

struct BernoulliPIRLS{T<:AbstractFloat,S<:Integer}
  X::Matrix{T}
  θβ::Vector{T}
  ytbl::NamedTuple{                     # column-table
    (:refs, :y, :η, :offset),
    Tuple{Vector{S},Vector{T},Vector{T},Vector{T}},
  }
  utbl::NamedTuple{                     # column-table
    (:u, :u0, :Ldiag, :pdev, :pdev0, :aGHQ),
    NTuple{6,Vector{T}},
  }
  rtbl::Vector{                         # row-table
    NamedTuple{(:μ, :dev, :rtwwt, :wwres, :wwresp),NTuple{5,T}},
  }
end

with an external constructor

function BernoulliPIRLS(
  X::Matrix{T},
  y::Vector{T},
  refs::Vector{S},
) where {T<:AbstractFloat,S<:Integer}
  # use IRLS to check X and y, obtain initial β, and establish rtbl
  irls = fit!(BernoulliIRLS(X, y); verbose=false)
  β = irls.β
  θβ = append!(ones(T, 1), β)       # initial θ = 1
  η = X * β

  # refs should contain all values from 1 to maximum(refs)
  refvals = sort!(unique(refs))
  q = length(refvals)
  if refvals  1:q
    throw(ArgumentError("sort!(unique(refs)) must be 1:$q"))
  end
  length(refs) == length(y) ||
    throw(ArgumentError("lengths of y and refs aren't equal"))

  ytbl = (; refs, y, η, offset=copy(η))

  utbl = NamedTuple(
    nm => zeros(T, q) for
    nm in (:u, :u0, :Ldiag, :pdev, :pdev0, :aGHQ)
  )
  return updatetbl!(BernoulliPIRLS(X, θβ, ytbl, utbl, irls.rtbl))
end

The reason for storing both \({\boldsymbol{\theta}}\) and \({\boldsymbol{\beta}}\) in a single vector is to provide for their simultaneous optimization with an optimizer such as those in NLopt.jl.

The utbl field in a BernoulliPIRLS struct contains vectors named u0, pdev, pdev0, and aGHQ, in addition to u and Ldiag. These are not used in the PIRLS algorithm (other than for keeping a copy of the previous \({\mathbf{u}}\)) or for optimizing Laplace’s approximation to the objective. However, they will be used in adaptive Gauss-Hermite quadrature evaluation of the objective (Section C.6), so we keep them in the struct throughout.

The updatetbl! method for this type first evaluates \({\boldsymbol{\eta}}\) via a “virtual” multiplication that forms \({\mathbf{Z}}{\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}{\mathbf{u}}\) plus the stored offset, which is \({\mathbf{X}}{\boldsymbol{\beta}}\), then updates the rowtable from \({\mathbf{y}}\), \({\boldsymbol{\eta}}\), and the offset. For this model \({\boldsymbol{\theta}}\) is one-dimensional and \({\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}\) is a scalar multiple of \({\mathbf{I}}_q\), the identity matrix of size \(q\), and thus the matrix multiplication by \({\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}\) can be expressed as scalar products.

function updatetbl!(m::BernoulliPIRLS)
  (; refs, y, η, offset) = m.ytbl
  u = m.utbl.u
  θ = first(m.θβ)
  # evaluate η = offset + ZΛu where Λ is θ * I and Z is one-hot
  fill!(η, 0)
  @inbounds for i in eachindex(η, refs, offset)
    η[i] += offset[i] + u[refs[i]] * θ
  end
  m.rtbl .= tblrow.(y, η, offset)
  return m
end

The pdeviance method returns the deviance for the GLM model plus the penalty on the squared length of u.

function pdeviance(m::BernoulliPIRLS)
  return sum(r.dev for r in m.rtbl) + sum(abs2, m.utbl.u)
end

The updateu! method is similar to updateβ! for the BernoulliIRLS type except that it is based on the diagonal matrix \({\boldsymbol{\Lambda}}'{\mathbf{Z}}'{\mathbf{W}}{\mathbf{Z}}{\boldsymbol{\Lambda}}+ {\mathbf{I}}\). Only the diagonal elements of this matrix are constructed and used to solve for the updated \({\mathbf{u}}\) vector. At convergence of the PIRLS algorithm the elements of Ldiag are replaced by their square roots.

function updateu!(m::BernoulliPIRLS)
  (; u, u0, Ldiag) = m.utbl
  copyto!(u0, u)              # keep a copy of u
  θ = first(m.θβ)             # extract the scalar θ
  fill!(u, 0)
  if iszero(θ)                # skip the update if θ == 0
    fill!(Ldiag, 1)           # L is the identity if θ == 0
    return updatetbl!(m)
  end
  fill!(Ldiag, 0)
  @inbounds for (ri, ti) in zip(m.ytbl.refs, m.rtbl)
    rtWΛ = θ * ti.rtwwt       # non-zero in i'th row of √WZΛ
    Ldiag[ri] += abs2(rtWΛ)   # accumulate Λ'Z'WZΛ
    u[ri] += rtWΛ * ti.wwresp # accumulate Λ'Z'Wỹ
  end
  Ldiag .+= 1                 # form diagonal of Λ'Z'WZΛ + I = LL'
  u ./= Ldiag                 # solve for u with diagonal LL'
  return updatetbl!(m)        # and update η and rtbl
end

Create a BernoulliPIRLS struct for the com05 model and check the penalized deviance at the initial values

m = BernoulliPIRLS(com05.X, com05.y, only(com05.reterms).refs)
pdeviance(m)
2409.377428160041

As with IRLS, the first iteration of PIRLS reduces the objective, which is the penalized deviance in this case, substantially.

pdeviance(updateu!(m))
2233.1209476357844

Create a pirls! method for this struct.

function pirls!(m::BernoulliPIRLS; verbose::Bool=false)
  (; u, u0, Ldiag) = m.utbl
  fill!(u, 0)                  # start from u == 0
  copyto!(u0, u)               # keep a copy of u
  oldpdev = pdeviance(updatetbl!(m))
  verbose && @info 0, oldpdev
  for i in 1:10                # maximum of 10 PIRLS iterations
    newpdev = pdeviance(updateu!(m))
    verbose && @info i, newpdev
    if newpdev > oldpdev       # PIRLS iteration failed
      @warn "PIRLS iteration did not reduce penalized deviance"
      copyto!(u, u0)           # restore previous u
      updatetbl!(m)            # restore η and rtbl
      break
    elseif (oldpdev - newpdev) < (1.0e-8 * oldpdev)
      copyto!(u0, u)           # keep a copy of u
      break
    else
      copyto!(u0, u)           # keep a copy of u
      oldpdev = newpdev
    end
  end
  map!(sqrt, Ldiag, Ldiag)     # replace diag(LL') by diag(L)
  return m
end

The PIRLS iterations always start from \({\mathbf{u}}=\mathbf{0}\) so that the converged value of the penalized deviance is reproducible for given values of \(\theta\) and \({\boldsymbol{\beta}}\). If we allowed the algorithm to start at whatever values are currently stored in \({\mathbf{u}}\) then there could be slight differences in the value of the penalized deviance at convergence of PIRLS, which can cause problems when trying to optimize with respect to \(\theta\) and \({\boldsymbol{\beta}}\).

pirls!(m; verbose=true);
[ Info: (0, 2409.377428160041)
[ Info: (1, 2233.1209476357844)
[ Info: (2, 2231.605935279706)
[ Info: (3, 2231.6002198321576)
[ Info: (4, 2231.600219406561)

As with IRLS, PIRLS is a fast and stable algorithm for determining the mode of the conditional distribution \(({\mathcal{U}}|{\mathcal{Y}}={\mathbf{y}})\) with \({\boldsymbol{\theta}}\) and \({\boldsymbol{\beta}}\) held fixed.

@benchmark pirls!(mm) seconds = 1 setup = (mm = m)
BenchmarkTools.Trial: 4402 samples with 1 evaluation.
 Range (min … max):  199.693 μs … 331.599 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     227.186 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   226.091 μs ±   5.569 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                        ▁         ▇█             
  ▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▁▂▂▁▁▂▂▂▂▁▂▁▁▁▁▄█▄▄▃▂▃▃▃▃▃██▇▆▃▃▅▄▃▂▂▂▂ ▃
  200 μs           Histogram: frequency by time          234 μs <

 Memory estimate: 112 bytes, allocs estimate: 1.

The time taken for the four iterations to determine the conditional mode of \({\mathbf{u}}\) is comparable to the time taken for a single call to updateβ!. Most of the time in updateβ! is spent in the QR factorization to solve the weighted least squares problem, whereas in updateu!and thus in pirls!, we take advantage of the fact that the solution of the penalized, weighted least squares problem is based on a diagonal matrix.

C.5 Laplace’s approximation to the log-likelihood

The PIRLS algorithm determines the value of \({\mathbf{u}}\) that minimizes the penalized deviance \[ \tilde{{\mathbf{u}}}=\arg\min_{{\mathbf{u}}}\left(\left\|{\mathbf{u}}\right\|^2+\sum_{i=1}^n d(y_i,\eta_i)\right) , \] where \(\eta_i, i=1,\dots,n\) is the \(i\)th component of \({\boldsymbol{\eta}}={\mathbf{X}}{\boldsymbol{\beta}}+{\mathbf{Z}}{\boldsymbol{\Lambda}}{\mathbf{u}}\). A quadratic approximation to the penalized deviance at \(\tilde{{\mathbf{u}}}\) is \[ \begin{aligned} \tilde{d}({\mathbf{y}},{\mathbf{u}})&=\|{\mathbf{u}}\|^2+\sum_{i=1}^n d(y_i,\eta_i)\\ &\approx\|\tilde{{\mathbf{u}}}\|^2+\sum_{i=1}^n d(y_i,\tilde{\eta}_i)+ ({\mathbf{u}}-\tilde{{\mathbf{u}}})'({\boldsymbol{\Lambda}}'{\mathbf{Z}}'{\mathbf{W}}{\mathbf{Z}}{\boldsymbol{\Lambda}}+{\mathbf{I}})({\mathbf{u}}-\tilde{{\mathbf{u}}})\\ &=\|\tilde{{\mathbf{u}}}\|^2+\sum_{i=1}^n d(y_i,\tilde{\eta}_i)+ ({\mathbf{u}}-\tilde{{\mathbf{u}}})'{\mathbf{L}}{\mathbf{L}}'({\mathbf{u}}-\tilde{{\mathbf{u}}})\\ &=\tilde{d}({\mathbf{y}},\tilde{{\mathbf{u}}})+ ({\mathbf{u}}-\tilde{{\mathbf{u}}})'{\mathbf{L}}{\mathbf{L}}'({\mathbf{u}}-\tilde{{\mathbf{u}}}) \end{aligned} \tag{C.9}\] where \({\mathbf{L}}\) is the lower Cholesky factor of \({\boldsymbol{\Lambda}}'{\mathbf{Z}}'{\mathbf{W}}{\mathbf{Z}}{\boldsymbol{\Lambda}}+{\mathbf{I}}\). (In Equation C.9 the linear term in \(({\mathbf{u}}-\tilde{{\mathbf{u}}})\) that would normally occur in such an expression is omitted because the gradient of \(\tilde{d}({\mathbf{y}},{\mathbf{u}})\) is zero at \(\tilde{{\mathbf{u}}}\).)

Laplace’s approximation to the log-likelihood uses this quadratic approximation to the penalized deviance, which is negative one-half the logarithm of the integrand, to approximate the value of the integral.

On the deviance scale, which is negative twice the log-likelihood, the approximation is \[ \begin{aligned} -2\,\ell({\mathbf{u}}|{\mathbf{y}},{\boldsymbol{\theta}},{\boldsymbol{\beta}})&=-2\,\log\left(L({\mathbf{u}}|{\mathbf{y}},{\boldsymbol{\theta}},{\boldsymbol{\beta}})\right)\\ &=-2\,\log\left(\int_{{\mathbf{u}}}\frac{1}{\sqrt{2\pi}^q}\exp\left(\frac{\left\|{\mathbf{u}}\right\|^2+\sum_{i=1}^n d(y_i,\mu_i)}{-2}\right)\,d{\mathbf{u}}\right)\\ &\approx\tilde{d}({\mathbf{y}},\tilde{{\mathbf{u}}})-2\,\log\left( \int_{\mathbf{u}}\frac{1}{\sqrt{2\pi}^q}\exp\left(\frac{[{\mathbf{u}}-\tilde{{\mathbf{u}}}]'{\mathbf{L}}{\mathbf{L}}'[{\mathbf{u}}-\tilde{{\mathbf{u}}}]}{-2}\right)\,d{\mathbf{u}} \right)\\ &=\tilde{d}({\mathbf{y}},\tilde{{\mathbf{u}}})-2\,\log\left( \int_{\mathbf{u}}\frac{1}{\sqrt{2\pi}^q}\exp\left(\frac{\left\|{\mathbf{L}}'[{\mathbf{u}}-\tilde{{\mathbf{u}}}\right\|^2}{-2}\right)\,d{\mathbf{u}} \right)\\ &=\tilde{d}({\mathbf{y}},\tilde{{\mathbf{u}}})-2\,\log\left(|{\mathbf{L}}|^{-1}\right)\\ &=\tilde{d}({\mathbf{y}},\tilde{{\mathbf{u}}})+\log\left(|{\mathbf{L}}|^2\right) \end{aligned} \]

function laplaceapprox(m::BernoulliPIRLS)
  return pdeviance(m) + 2 * sum(log, m.utbl.Ldiag)
end
laplaceapprox(pirls!(m))
2373.518052752164

The remaining step is to optimize Laplace’s approximation to the GLMM deviance with respect to \(\theta\) and \({\boldsymbol{\beta}}\), which we do using the BOBYQA optimizer from NLopt.jl

function StatsAPI.fit!(m::BernoulliPIRLS)
  θβ = m.θβ
  pp1 = length(θβ)                # length(β) = p and length(θ) = 1
  opt = Opt(:LN_BOBYQA, pp1)
= view(θβ, 2:pp1)
  function obj(x, g)
    if !isempty(g)
      throw(ArgumentError("gradient not provided, g must be empty"))
    end
    copyto!(θβ, x)
    mul!(m.ytbl.offset, m.X, mβ)
    return laplaceapprox(pirls!(m))
  end
  opt.min_objective = obj
  lb = fill!(similar(θβ), -Inf)   # vector of lower bounds
  lb[1] = 0                       # scalar θ must be non-negative
  NLopt.lower_bounds!(opt, lb)
  minf, minx, ret = optimize(opt, copy(θβ))
  @info (; ret, fevals=opt.numevals, minf)
  return m
end
fit!(m);
[ Info: (ret = :ROUNDOFF_LIMITED, fevals = 550, minf = 2354.4744815688055)
Code
print(
  "Converged to θ = ",
  first(m.θβ),
  " and β =",
  view(m.θβ, 2:lastindex(m.θβ)),
)
Converged to θ = 0.5683043594028967 and β =[-0.3409777149845993, 0.3933796201906975, 0.6064857599227369, -0.012926172564277872, 0.03323478854784157, -0.005626184982660486]

These estimates differ somewhat from those for model com05.

Code
print(
  "Estimates for com05: θ = ",
  only(com05.θ),
  ", fmin = ",
  deviance(com05),
  ", and β =",
  com05.β,
)
Estimates for com05: θ = 0.5761507901895634, fmin = 2353.8241980539815, and β =[-0.3414913998306781, 0.3936080536502067, 0.6064861079468472, -0.012911714642169572, 0.03321662487439253, -0.005625046845040066]

The discrepancy in the results is because the com05 results are based on a more accurate approximation to the integral called adaptive Gauss-Hermite Quadrature, which is discussed in Section C.6.

C.5.1 Generalizations to more complex structure

There is an implicit assumption in the BernoulliPIRLS structure that random effects in the model are simple, scalar random effects associated with a single grouping factor, which is represented by m.ytbl.refs. For such models the random effects model matrix, \({\mathbf{Z}}\), is an \(n\times q\) indicator matrix, the covariance parameter, \({\boldsymbol{\theta}}\), is one-dimensional and the covariance factor, \({\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}=\theta\,{\mathbf{I}}_q\) is a scalar multiple of the \(q\times q\) identity matrix, \({\mathbf{I}}_q\). Furthermore, \({\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}'{\mathbf{Z}}'{\mathbf{W}}{\mathbf{Z}}{\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}+{\mathbf{I}}_q\) is also diagonal, as is its Cholesky factor, \({\mathbf{L}}\).

We have taken advantage of the special structure of these matrices both in representations — storing \({\mathbf{L}}\) by storing only the diagonal values in the vector m.utbl.Ldiag — and in some algorithms for the PIRLS iterative step.

The PIRLS algorithm to determine the conditional mode, \(\tilde{{\mathbf{u}}}\), of the random effects, \({\mathcal{U}}\), and Laplace’s approximation to the log-likelihood for GLMMs can be generalized to models with vector-valued random effects, or with random effects associated with more than one grouping factor, or with both. The more general representation of \({\mathbf{Z}}\) and \({\mathbf{L}}\) used with linear mixed models can be adapted for GLMMs as well.

We chose to specialize the representation of GLMMs in this appendix to this specific type of random effects to be able to demonstrate adaptive Gauss-Hermite quadrature in Section C.6, which, at present, is restricted to models with a single, simple, scalar, random effects term.

C.6 Adaptive Gauss-Hermite quadrature

Recall from Equation C.9 that Laplace’s approximation to the likelihood is based on a quadratic approximation to the penalized (GLM) deviance, \(\tilde{d}({\mathbf{y}},{\mathbf{u}})\), at the conditional mode \(\tilde{{\mathbf{u}}}\). In the case of a model with a single, scalar, random effects term, like the model com05, each linear predictor value, \(\eta_i,\,i=1,\dots,n\) depends on only one element of \({\mathbf{u}}\). Writing the set of indices \(i\) for which refs[i] == j as \({\mathcal{I}}(j)\), we can express the penalized deviance, and its quadratic approximation, as sums of scalar contributions, \[ \begin{aligned} \tilde{d}({\mathbf{y}},{\mathbf{u}})&=\|{\mathbf{u}}\|^2+\sum_{i=1}^n d(y_i,\mu_i)\\ &=\sum_{j=1}^q\left(u_j^2+\sum_{i\in{\mathcal{I}}(j)}d(y_i,\mu_i)\right)\\ &\approx\sum_{j=1}^q \left(\tilde{u}_j^2+\sum_{i\in{\mathcal{I}}(j)}\left(d(y_i,\tilde{\mu}_i)+\ell_j^2(u_j-\tilde{u}_j)^2\right)\right) \end{aligned} \] where \(\ell_j\) is the \(j\)th diagonal element of \({\mathbf{L}}\) and \(\tilde{\mu}_i\) is the value of \(\mu_i\) when \(u_j=\tilde{u}_j\).

Extending the notation of Equation C.9 we write the contribution from \(u_j\) to the penalized (GLM) deviance, and to its quadratic approximation, as \[ \begin{aligned} \tilde{d}_j(u_j)&= u_j+\sum_{i\in{\mathcal{I}}(j)}d(y_i,\mu_i)\\ &\approx \tilde{u_j} + \sum_{i\in{\mathcal{I}}(j)}\left(d(y_i,\tilde{\mu}_i)+\ell_j^2(u_j-\tilde{u}_j)\right)\\ &=\tilde{d}_j(\tilde{u}_j)+\ell_j^2(u_j-\tilde{u}_j)\quad j=1,\dots,q , \end{aligned} \] giving Laplace’s approximation to the scalar integral defining the contribution of \(u_j\) to negative twice the log-likelihood as \[ \begin{aligned} -2\log\int_{u_j}\frac{e^{-\tilde{d}_j(u_j)/2}}{\sqrt{2\pi}}\,du_j&\approx -2\log\int_{u_j}\frac{e^{\left((-\tilde{d}_j(\tilde{u_j})-\ell_j^2(u_j-\tilde{u}_j)^2)/2\right)}}{\sqrt{2\pi}}\,du_j\\ &=\tilde{d}_j(\tilde{u}_j)-2\log\int_{u_j}\frac{e^{-\ell_j^2(u_j-\tilde{u}_j)/2}}{\sqrt{2\pi}}\,du_j\\ &=\tilde{d}_j(\tilde{u}_j)-2\log\int_{z_j}\frac{e^{-z_j^2/2}}{\sqrt{2\pi}}\,\frac{dz_j}{\ell_j}\\ &=\tilde{d}_j(\tilde{u}_j)+2\log(\ell_j)-2\log\int_{z_j}\frac{e^{-z_j^2/2}}{\sqrt{2\pi}}\,dz_j\\ &=\tilde{d}_j(\tilde{u}_j)+2\log(\ell_j)-2\log(1)\\ &=\tilde{d}_j(\tilde{u}_j)+2\log(\ell_j) \end{aligned} \tag{C.10}\] using the change of variable \[ z_j=\ell_j(u_j-\tilde{u}_j)\quad j=1,\dots,q \tag{C.11}\] with inverse \[ u_j=\frac{z_j}{\ell_j}+\tilde{u}_j\quad j=1,\dots,q \tag{C.12}\] and derivative \[ \frac{du_j}{dz_j}=\frac{1}{\ell_j} . \]

The change of variable Equation C.11 allows us to express the contribution from \(u_j\) to Laplace’s approximation as a constant, \(\tilde{d}_j(\tilde{u}_j)\), plus the integral of a multiple, \(1/\ell_j\), of the density of a standard normal distribution \[ \phi(z)=\frac{e^{-z^2/2}}{\sqrt{2\pi}} . \]

The \(K\)th-order normalized Gauss-Hermite quadrature rule allows us to extend this approach to evaluate integrals of the form \[ \int_z f(z)\frac{e^{-z^2/2}}{\sqrt{2\pi}}\, dz \approx \sum_{k=1}^K w_k f(z_k) \tag{C.13}\] where the weights, \(w_k,\,k=1,\dots,K\), and the absiccae, \(z_k,\,k=1,\dots,K\) are evaluated as described in Section C.6.1. The approximation Equation C.13 is exact when \(f(z)\) is a polynomial of order \(2K-1\) or less.

We will apply a rule like Equation C.13 where \(f\) is the exponential of negative half the difference between the penalized deviance, \(\tilde{d}_j(z_j/\ell_j+\tilde{u}_j)\), and its quadratic approximation at the conditional mode, \(\tilde{u}_j\). That is, we will center the standard normal density at the conditional mode, \(\tilde{u}_j,\,j=1,\dots,q\), and scale it by the inverse of the quadratic term, \(\ell_j,\,j=1,\dots,q\), in the quadratic approximation at that value of \(u\). This is said to be an adaptive quadrature rule because we are shifting and scaling the evaluation points according to the current conditions in the iterative algorithm.

In other words we will first apply PIRLS to determine the conditional modes and the quadratic terms, then use a normalized Gauss-Hermite quadrature rule.

C.6.1 Normalized Gauss-Hermite quadrature rules

Gaussian Quadrature rules provide sets of x values, called abscissae, and corresponding weights, w, to approximate an integral with respect to a weight function, \(g(x)\). For a \(K\)th order rule the approximation is \[ \int f(x)g(x)\,dx \approx \sum_{k=1}^K w_k f(x_k) \]

For the Gauss-Hermite rule the weight function is \[ g(x) = e^{-x^2} \] and the domain of integration is \((-\infty, \infty)\). A slight variation of this is the normalized Gauss-Hermite rule for which the weight function is the standard normal density \[ g(z) = \phi(z) = \frac{e^{-z^2/2}}{\sqrt{2\pi}} . \]

Thus, the expected value of \(f(z)\), where \(\mathcal{Z}\sim\mathscr{N}(0,1)\), is approximated as \[ \mathbb{E}[f]=\int_{-\infty}^{\infty} f(z) \phi(z)\,dz\approx\sum_{k=1}^K w_k\,f(z_k) . \]

Naturally, there is a caveat. For the approximation to be accurate the function \(f(z)\) must behave like a low-order polynomial over the range of interest. More formally, a \(K\)th order rule is exact when \(f\) is a polynomial of order \(2K-1\) or less.

C.6.1.1 Evaluating the weights and abscissae

In the Golub-Welsch algorithm the abscissae for a particular Gaussian quadrature rule are determined as the eigenvalues of a symmetric tri-diagonal matrix and the weights are derived from the squares of the first row of the matrix of eigenvectors. For a \(K\)th order normalized Gauss-Hermite rule the tridiagonal matrix has zeros on the diagonal and the square roots of 1:k-1 on the super- and sub-diagonal, e.g.

sym5 = SymTridiagonal(zeros(5), sqrt.(1:4))
5×5 SymTridiagonal{Float64, Vector{Float64}}:
 0.0  1.0       ⋅        ⋅        ⋅ 
 1.0  0.0      1.41421   ⋅        ⋅ 
  ⋅   1.41421  0.0      1.73205   ⋅ 
  ⋅    ⋅       1.73205  0.0      2.0
  ⋅    ⋅        ⋅       2.0      0.0
ev = eigen(sym5);
ev.values
5-element Vector{Float64}:
 -2.8569700138728
 -1.3556261799742608
  1.3322676295501878e-15
  1.3556261799742677
  2.8569700138728056
abs2.(ev.vectors[1, :])
5-element Vector{Float64}:
 0.011257411327720667
 0.2220759220056134
 0.5333333333333334
 0.222075922005612
 0.011257411327720648

A function of k to evaluate the abscissae and weights is

function gausshermitenorm(k)
  ev = eigen(SymTridiagonal(zeros(k), sqrt.(1:(k - 1))))
  return Table((;
    abscissae=ev.values,
    weights=abs2.(ev.vectors[1, :]),
  ))
end

providing

gausshermitenorm(5)
Table with 2 columns and 5 rows:
     abscissae    weights
   ┌───────────────────────
 1 │ -2.85697     0.0112574
 2 │ -1.35563     0.222076
 3 │ 1.33227e-15  0.533333
 4 │ 1.35563      0.222076
 5 │ 2.85697      0.0112574

The weights and positions for the 9th order rule are shown in Figure C.1.

Code
draw(
  data(gausshermitenorm(9)) *
  mapping(:abscissae => "Positions", :weights);
  figure=(; size=(600, 450)),
)
Figure C.1: Weights and positions for the 9th order normalized Gauss-Hermite quadrature rule

Notice that the magnitudes of the weights drop quite dramatically away from zero, even on a logarithmic scale (Figure C.2)

Code
draw(
  data(gausshermitenorm(9)) * mapping(
    :abscissae => "Positions",
    :weights => log2 => "log₂(weight)",
  );
  figure=(; size=(600, 450)),
)
Figure C.2: Weights (logarithm base 2) and positions for the 9th order normalized Gauss-Hermite quadrature rule

The definition of MixedModels.GHnorm is similar to the gausshermitenorm function with some extra provisions for ensuring symmetry of the abscissae and of the weights and for caching values once they have been calculated.

let tbl = GHnorm(9)
  Table(abscissae=tbl.z, weights=tbl.w)
end
Table with 2 columns and 9 rows:
     abscissae  weights
   ┌──────────────────────
 1 │ -4.51275   2.23458e-5
 2 │ -3.20543   0.00278914
 3 │ -2.07685   0.0499164
 4 │ -1.02326   0.244098
 5 │ 0.0        0.406349
 6 │ 1.02326    0.244098
 7 │ 2.07685    0.0499164
 8 │ 3.20543    0.00278914
 9 │ 4.51275    2.23458e-5

In particular, when \(K\) is odd the middle abscissa, at index \((K+1)/2\), is exactly zero.

As an example of evaluation using these weights and abscissae, we consider \[ \mathbb{E}[g(x)] \approx \sum_{i=1}^k g(\mu + \sigma z_i)\,w_i \] where \(\mathcal{X}\sim\mathscr{N}(\mu, \sigma^2)\).

For example, \(\mathbb{E}[\mathcal{X}^2]\) where \(\mathcal{X}\sim\mathcal{N}(2, 3^2)\) is

let μ = 2, σ = 3, ghn3 = GHnorm(3)
  sum(@. ghn3.w * abs2+ σ * ghn3.z))  # should be μ² + σ² = 13
end
13.0

(In general a dot, ‘.’, after the function name in a function call, as in abs2.(...), or before an operator creates a fused vectorized evaluation in Julia. The macro @. has the effect of vectorizing all operations in the subsequent expression.)

C.6.2 Illustration of contributions to the objective

We have provided a pdev vector in the utbl field of a BernoulliPIRLS object to allow for accumulation of the penalized deviance contributions for each component of \({\mathbf{u}}\) in the model.

function pdevcomps!(m::BernoulliPIRLS)
  (; u, pdev) = m.utbl
  pdev .= abs2.(u)        # initialize pdevj to square of uj
  @inbounds for (ri, ti) in zip(m.ytbl.refs, m.rtbl)
    pdev[ri] += ti.dev
  end
  return m
end

After PIRLS has converged, we evaluate pdevcomps! and copy the pdev column of the utbl field to its pdev0 column, which will be the baseline evaluation of the penalized deviance at the conditional modes. Other evaluations of the penalized deviance components are plotted as differences from pdev0.

pdevcomps!(pirls!(m))
copyto!(m.utbl.pdev0, m.utbl.pdev)
Table(m.utbl)
Table with 6 columns and 102 rows:
      u           u0          Ldiag    pdev     pdev0    aGHQ
    ┌────────────────────────────────────────────────────────
 1  │ -1.02424    -1.02424    2.3596   78.2322  78.2322  0.0
 2  │ -1.6554     -1.6554     1.87894  41.917   41.917   0.0
 3  │ -0.0432085  -0.0432085  1.54253  21.8681  21.8681  0.0
 4  │ 0.500796    0.500796    1.07154  2.68642  2.68642  0.0
 5  │ 1.10928     1.10928     1.29471  8.58305  8.58305  0.0
 6  │ -0.415844   -0.415844   1.49343  18.6901  18.6901  0.0
 7  │ 0.030515    0.030515    1.06624  1.46897  1.46897  0.0
 8  │ 0.216409    0.216409    1.87125  46.1746  46.1746  0.0
 9  │ 0.433361    0.433361    1.2297   9.7717   9.7717   0.0
 10 │ -0.648279   -0.648279   2.10967  58.417   58.417   0.0
 11 │ -0.328134   -0.328134   1.47521  23.5301  23.5301  0.0
 12 │ 0.499976    0.499976    1.06882  2.74129  2.74129  0.0
 13 │ 0.139717    0.139717    1.82727  43.6776  43.6776  0.0
 14 │ 0.176548    0.176548    1.10663  2.77503  2.77503  0.0
 15 │ -0.471723   -0.471723   1.51184  21.2797  21.2797  0.0
 16 │ -0.757145   -0.757145   1.25035  7.09335  7.09335  0.0
 17 │ -1.59799    -1.59799    1.32299  8.74912  8.74912  0.0
 ⋮  │     ⋮           ⋮          ⋮        ⋮        ⋮      ⋮

Consider the change in the penalized deviance from that at the conditional mode for a selection of groups, which, by default, we choose to be the first 5 groups.

Code
function pdevdiff(
  m::BernoulliPIRLS{T};
  zvals=collect(-3.5:inv(32):3.5),
  inds=1:5,
) where {T}
  (; u, u0, Ldiag, pdev, pdev0) = m.utbl
  pdevcomps!(pirls!(m))             # assign u0
  copyto!(pdev0, pdev)              # and pdev0
  ni = length(inds)
  nz = length(zvals)
  uvals = Array{T}(undef, ni, nz)
  exact = Array{T}(undef, ni, nz)
  for (j, z) in enumerate(zvals)
    u .= u0 .+ z ./ Ldiag           # evaluate u from z
    pdevcomps!(updatetbl!(m))       # evaluate pdev
    for (i, ind) in enumerate(inds) # store selected u and pdev
      uvals[i, j] = u[ind]
      exact[i, j] = pdev[ind] - pdev0[ind]
    end
  end
  uvals = collect(uvals')           # transpose uvals
  exact = collect(exact')           # and exact
  return (; zvals, inds, uvals, exact)
end
m05pdevdiff = pdevdiff(m);

We begin with plots of the difference in the penalized deviance from its value at the conditional mode, \(\tilde{d}_j(u_j)-\tilde{d_j}(\tilde{u}_j)\), for \(j=1,\dots,5\) in Figure C.3

Code
let (; zvals, inds, uvals, exact) = m05pdevdiff,
  fig = Figure(; size=(600, 375)),
  ax = Axis(
    fig[1, 1];
    xlabel="u",
    ylabel="Change in penalized deviance",
  )

  lins = [
    lines!(ax, view(uvals, :, j), view(exact, :, j)) for
    j in axes(uvals, 2)
  ]
  Legend(fig[1, 2], lins, string.(inds))
  fig
end
Figure C.3: Change in the penalized deviance contribution from that at the conditional mode, for each of the first 5 groups, in model com05, as a function of u.

then shift and scale the horizontal axis to the \(z\) scale for this difference in the penalized deviance (from that at the conditional mode) in Figure C.4.

Code
let (; zvals, inds, uvals, exact) = m05pdevdiff,
  fig = Figure(; size=(600, 375)),
  ax = Axis(
    fig[1, 1];
    xlabel="z",
    ylabel="Change in penalized deviance",
  )

  lins =
    [lines!(ax, zvals, view(exact, :, j)) for j in axes(uvals, 2)]
  Legend(fig[1, 2], lins, string.(inds))
  fig
end
Figure C.4: Change in the penalized deviance contribution from that at the conditional mode, for each of the first 5 groups, in model com05, as a function of z.

The next stage is to plot, on the \(z\) scale, the difference between the penalized deviance and its quadratic approximation at \(z=0\) in Figure C.5

Code
let (; zvals, inds, uvals, exact) = m05pdevdiff,
  fig = Figure(; size=(600, 375)),
  ax = Axis(
    fig[1, 1];
    xlabel="z",
    ylabel="Penalized deviance minus quadratic approx",
  )

  lins = [
    lines!(ax, zvals, view(exact, :, j) .- abs2.(zvals)) for
    j in axes(uvals, 2)
  ]
  Legend(fig[1, 2], lins, string.(inds))
  fig
end
Figure C.5: The difference between the contribution to the penalized deviance and its quadratic approximation for each of first 5 groups in model com05 as a function of z.

and, finally, the exponential of negative half of this difference in Figure C.6

Code
let (; zvals, inds, uvals, exact) = m05pdevdiff,
  fig = Figure(; size=(600, 375)),
  ax = Axis(
    fig[1, 1];
    xlabel="z",
    ylabel="Exp of half the quadratic minus penalized deviance",
  )

  lins = [
    lines!(
      ax,
      zvals,
      exp.(0.5 .* (abs2.(zvals) .- view(exact, :, j))),
    ) for j in axes(uvals, 2)
  ]
  Legend(fig[1, 2], lins, string.(inds))
  fig
end
Figure C.6: Exponential of half the difference between the quadratic approximation and the contribution to the penalized deviance, for each of first 5 groups in model com05 as a function of z.

Writing the function shown in Figure C.6 — the exponential of negative half the difference between the penalized deviance and its quadratic approximation — as \[ f_j(z_j)=\exp{\left(\frac{\tilde{d}_j(z_j/\ell_j+\tilde{u}_j)-\tilde{d}_j(\tilde{u}_j)-z_j^2}{-2}\right)} \quad j=1,\dots,q \] we can modify the scalar version of Laplace’s approximation, Equation C.10, as \[ \begin{multline*} -2\log\int_{u_j}\frac{e^{-\tilde{d}_j(u_j)/2}}{\sqrt{2\pi}}\,du_j\\ =-2\log\int_{z_j}\frac{f_j(z_j)\,e^{-(\tilde{d}_j(\tilde{u}_j)+z_j^2)/2}}{\sqrt{2\pi}}\,\frac{dz_j}{\ell_j}\\ =\tilde{d}_j(\tilde{u}_j)+2\log(\ell_j)-2\log\int_{z_j}f_j(z_j)\frac{e^{-z_j^2/2}}{\sqrt{2\pi}}\,dz_j \end{multline*} \tag{C.14}\]

A \(K\)th-order adaptive Gauss-Hermite quadrature approximation to the objective, negative twice the log-likelihood, for the GLMM model is Laplace’s approximation minus twice the logarithm of the \(K\)th order normalized Gauss-Hermite quadrature rule applied to \(f_j(z_j)\). We use the aGHQ column of m.utbl to accumulate these contributions and to take the logarithm then multiply the result by -2.

Code
function evalGHQ!(m::BernoulliPIRLS; nGHQ::Integer=9)
  (; ytbl, utbl, rtbl) = m
  (; u, u0, Ldiag, pdev, pdev0, aGHQ) = utbl
  ghqtbl = GHnorm(nGHQ)
  pdevcomps!(pirls!(m))  # ensure that u0 and pdev0 are current
  copyto!(pdev0, pdev)
  fill!(aGHQ, 0)
  for (z, w) in zip(ghqtbl.z, ghqtbl.w)
    if iszero(z)         # exp term is one when z == 0
      aGHQ .+= w
    else
      u .= u0 .+ z ./ Ldiag
      pdevcomps!(updatetbl!(m))
      aGHQ .+= w .* exp.((abs2(z) .+ pdev0 .- pdev) ./ 2)
    end
  end
  map!(log, aGHQ, aGHQ)  # log.(aGHQ) in place
  aGHQ .*= -2
  return m
end
evalGHQ!(m)
Table(m.utbl)
Table with 6 columns and 102 rows:
      u         u0          Ldiag    pdev     pdev0    aGHQ
    ┌─────────────────────────────────────────────────────────────
 1  │ 0.888261  -1.02424    2.3596   98.9905  78.2322  -0.00503286
 2  │ 0.746346  -1.6554     1.87894  65.9282  41.917   -0.00602822
 3  │ 2.88235   -0.0432085  1.54253  42.7156  21.8681  -0.0077699
 4  │ 4.71226   0.500796    1.07154  22.5154  2.68642  -0.00332321
 5  │ 4.5948    1.10928     1.29471  26.5466  8.58305  -0.00560788
 6  │ 2.60588   -0.415844   1.49343  40.3098  18.6901  -0.00726535
 7  │ 4.26289   0.030515    1.06624  21.5982  1.46897  -0.00232427
 8  │ 2.62803   0.216409    1.87125  67.5008  46.1746  -0.00639679
 9  │ 4.10316   0.433361    1.2297   28.6561  9.7717   -0.00680155
 10 │ 1.4908    -0.648279   2.10967  80.9664  58.417   -0.00583025
 11 │ 2.73092   -0.328134   1.47521  44.9692  23.5301  -0.00736271
 12 │ 4.72216   0.499976    1.06882  22.6241  2.74129  -0.00283963
 13 │ 2.60939   0.139717    1.82727  64.9152  43.6776  -0.00681333
 14 │ 4.25447   0.176548    1.10663  22.371   2.77503  -0.00457801
 15 │ 2.51322   -0.471723   1.51184  43.0725  21.2797  -0.00742932
 16 │ 2.85204   -0.757145   1.25035  29.6883  7.09335  -0.00309483
 17 │ 1.81302   -1.59799    1.32299  32.9677  8.74912  -0.00213836
 ⋮  │    ⋮          ⋮          ⋮        ⋮        ⋮          ⋮
extrema(m.utbl.aGHQ)
(-0.008514108854172236, -0.0004132698576107887)

As we see, these “correction terms” relative to Laplace’s approximation are relatively small, compared to the contributions to the objective from each component of \({\mathbf{u}}\). Also, the corrections are all negative, in this case. Close examination of the individual curves in Figure C.5 shows that these curves, which are \(-2\log(f_j(z))\), are more-or-less odd functions, in the sense that the value at \(-z\) is approximately the negative of the value at \(z\). If we were integrating \(\log(f_j(z_j))\phi(z_j)\) with a normalized Gauss-Hermite rule the negative and positive values would cancel out, for the most part, and some of the integrals would be positive while others would be negative.

When we consider \(f_j(z_j)\), shown in Figure C.6, the exponential function converts from differences to ratios and stretches positive differences more than negative differences, resulting in values slightly greater than 1 for \(\int_z f_j(z)\phi(z) dz\) and, after taking negative twice the logarithm, correction terms that are slightly less than zero.

C.7 Optimization of the aGHQ objective

function fitGHQ!(m::BernoulliPIRLS; nGHQ::Integer=9)
  (; Ldiag, pdev0, aGHQ) = m.utbl
  θβ = m.θβ
  pp1 = length(θβ)                # length(β) = p and length(θ) = 1
  opt = Opt(:LN_BOBYQA, pp1)
= view(θβ, 2:pp1)
  function obj(x, g)
    if !isempty(g)
      throw(ArgumentError("gradient not provided, g must be empty"))
    end
    copyto!(θβ, x)
    mul!(m.ytbl.offset, m.X, mβ)
    evalGHQ!(m; nGHQ)
    return sum(pdev0) + sum(aGHQ) + 2 * sum(log, Ldiag)
  end
  opt.min_objective = obj
  lb = fill!(similar(θβ), -Inf)   # vector of lower bounds
  lb[1] = 0                       # scalar θ must be non-negative
  NLopt.lower_bounds!(opt, lb)
  minf, minx, ret = optimize(opt, copy(θβ))
  @info (; ret, fevals=opt.numevals, minf)
  return m
end
fitGHQ!(m)
m.θβ
┌ Warning: PIRLS iteration did not reduce penalized deviance
└ @ Main.Notebook ~/Work/EmbraceUncertainty/aGHQ.qmd:1064
┌ Warning: PIRLS iteration did not reduce penalized deviance
└ @ Main.Notebook ~/Work/EmbraceUncertainty/aGHQ.qmd:1064
[ Info: (ret = :ROUNDOFF_LIMITED, fevals = 479, minf = 2353.82419755322)
7-element Vector{Float64}:
  0.5761321679271924
 -0.3414655990254175
  0.39359939391066806
  0.6064447618771712
 -0.012909685721680265
  0.03320994962034241
 -0.005624606329593786

This page was rendered from git revision 05a171b .