## Code

```
using AlgebraOfGraphics
using BenchmarkTools
using CairoMakie
using EmbraceUncertainty: dataset
using FreqTables
using LinearAlgebra
using MixedModels
using MixedModelsMakie
using NLopt
using PooledArrays
using StatsAPI
```

The log-likelihood for a linear mixed model (LMM) was derived in Section B.7, where some computational methods for fitting such models with MixedModels.jl, by optimizing a *profiled log-likelihood*, were illustrated.

In this appendix we outline the evaluation of the log-likelihood for a generalized linear mixed model (GLMM) with a binary response, which is modelled using the Bernoulli distribution.

The Bernoulli GLMM model defines the conditional distribution \(({{\mathcal{Y}}}|{{\mathcal{B}}}={{\mathbf{b}}})\) as independent Bernoulli random variables with expected values \({{\boldsymbol{\mu}}}={{\mathbf{g}}}^{-1}({{\boldsymbol{\eta}}})\), where \({{\boldsymbol{\eta}}}={{\mathbf{X}}}{{\boldsymbol{\beta}}}+{{\mathbf{Z}}}{{\mathbf{b}}}\) is the *linear predictor* and \({\mathbf{g}}^{-1}\) is an *inverse link function*.

We will use the *logit link function*, \({\boldsymbol{\eta}}={\mathbf{g}}({\boldsymbol{\mu}})\), defined component-wise from the scalar logit link, \(g\), as \[
\eta_i=g(\mu_i)=\mathrm{logit}(\mu_i)=\log\left(\frac{\mu_i}{1-\mu_i}\right)\quad i=1,\dots,n .
\] The inverse link, \({\boldsymbol{\mu}}={\mathbf{g}}^{-1}({\boldsymbol{\eta}})\), is similarly defined component-wise from the inverse of the scalar logit, which is the scalar *logistic* function, \[
\mu_i=g^{-1}(\eta_i)=\mathrm{logistic}(\eta_i)=\frac{1}{1+e^{-\eta_i}}\quad i=1,\dots,n .
\tag{C.1}\] The logit is the *canonical link function* (Section 6.2.2) for the Bernoulli distribution.

As in the linear mixed model discussed in Section B.7, the \(q\)-dimensional random effects, \({\mathcal{B}}\), are expressed as \({\mathcal{B}}={\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}\,{\mathcal{U}}\) where \({\mathcal{U}}\) has a standard, multivariate Gaussian distribution (Section B.2) \[ {\mathcal{U}}\sim{\mathcal{N}}({\mathbf{0}},{\mathbf{I}}_q) , \tag{C.2}\] with probability density function \[ f_{{\mathcal{U}}}({\mathbf{u}})=\frac{1}{\sqrt{2\pi}^q}e^{-\|{\mathbf{u}}\|^2/2} . \tag{C.3}\]

For a linear mixed model the distribution of these *spherical random effects* was given as \({\mathcal{U}}\sim({\mathbf{0}},\sigma^2{\mathbf{I}}_q)\) (Equation B.35). A dispersion parameter like \(\sigma^2\) is not present in Equation C.2 because the Bernoulli distribution does not have a separate dispersion parameter — it is entirely determined by its mean.

As is the case for the linear mixed model, the *covariance factor*, \({\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}\), is sparse and patterned. It is not uncommon in practical examples, such as the one in Section C.4, for \({\boldsymbol{\theta}}\) to be one-dimensional and \({\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}=\theta\,{\mathbf{I}}_q\), to be a scalar multiple of the \(q\times q\) identity matrix.

The likelihood for the parameters, \({\boldsymbol{\theta}}\) and \({\boldsymbol{\beta}}\), given the observed data, \({\mathbf{y}}\), is the value of the marginal probability mass function for the response, \({\mathcal{Y}}\), evaluated at \({\mathbf{y}}\), the observed vector of {0,1} responses. We obtain this value by integrating the product of the probability mass function for the conditional distribution, \(({\mathcal{Y}}|{\mathcal{U}}={\mathbf{u}})\), and unconditional density of \({\mathcal{U}}\) (Equation C.3), with respect to \({\mathbf{u}}\).

Recall that the probability mass for a single Bernoulli response can be written as \((1-\mu)^{1-y}\mu^y\), which is the specialization to \(n=1\) of the probability mass function for the binomial distribution \[
\binom{n}{y}(1-\mu)^{n-y}\mu^y ,\quad 0\le\mu\le 1, \quad y\in\{0,\dots,n\} .
\] Because the components of the vector-valued conditional distribution, \(({\mathcal{Y}}|{\mathcal{U}}={\mathbf{u}})\), are assumed to be independent, its probability mass function can be written as the product of the probability masses for each component \[
f_{{\mathcal{Y}}|{\mathcal{U}}={\mathbf{u}}}({\mathbf{y}}|{\mathbf{u}})=\prod_{i=1}^n \left[(1-\mu_i)^{1-y_i}\mu_i^{y_i}\right]
\quad\mathrm{where}\quad
{\boldsymbol{\mu}}={\mathbf{g}}^{-1}({\mathbf{X}}{\boldsymbol{\beta}}+{\mathbf{Z}}{\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}{\mathbf{u}}) ,
\] providing the likelihood as \[
\begin{aligned}
L({\boldsymbol{\eta}},{\boldsymbol{\theta}}|{\mathbf{y}})&=
\int_{{\mathbf{u}}}f_{{\mathcal{Y}},{\mathcal{U}}={\mathbf{u}}}({\mathbf{y}},{\mathbf{u}})f_{{\mathcal{U}}}({\mathbf{u}})\,d{\mathbf{u}}\\
&=\int_{{\mathbf{u}}}\frac{1}{\sqrt{2\pi}^q}e^{\sum_{i=1}^n(1-y_i)\log(1-\mu_i)+y_i\,\log(\mu_i)}
e^{-\left\|{\mathbf{u}}\right\|^2/2}\,d{\mathbf{u}}\\
&=\int_{{\mathbf{u}}}\frac{1}{\sqrt{2\pi}^q}\exp\left(\frac{\left\|{\mathbf{u}}\right\|^2+\sum_{i=1}^n d(y_i,\mu_i)}{-2}\right)\,d{\mathbf{u}}
\end{aligned}
\tag{C.4}\] where the *unit deviances*, \(d(y_i,\mu_i)\), are \[
d(y_i,\mu_i)=-2\left[(1-y_i)\log(1-\mu_i)+y_i\log(\mu_i)\right]\quad i=1,\dots,n .
\tag{C.5}\]

By converting from the logarithm of the probability mass function to the deviance scale, which is negative twice the log-probability, we get a quantity, \(\sum_{i=1}^n d(y_i,\mu_i)\), which is on the same scale as the squared length, \(\|{\mathbf{u}}\|^2\), of a standard multivariate Gaussian. The sum of the unit deviances is analogous to the sum of squared residuals, \(\|{\mathbf{y}}-{\mathbf{X}}{\boldsymbol{\beta}}\|^2\), in a linear model.

In Section B.7 we showed that the integral defining the likelihood for a linear mixed model, Equation B.43, has an analytic solution. In general, the integral in Equation C.4 does not. We will approximate the value of this integral using a quadratic approximation to the argument of the exponential function in Equation C.4 at the value of \({\mathbf{u}}\) that maximizes the integrand, which is the density of the conditional distribution, \(({\mathcal{U}}|{\mathcal{Y}}={\mathbf{y}})\), up to a scale factor. Because the scale factor does not affect the location of the maximum, the value of \({\mathbf{u}}\) that maximizes the integrand, \[
\begin{aligned}
\tilde{{\mathbf{u}}}({\mathbf{y}}|{\boldsymbol{\theta}},{\boldsymbol{\beta}})
&=\arg\max_{{\mathbf{u}}}\exp\left(\frac{\left\|{\mathbf{u}}\right\|^2 + \sum_{i=1}^n d(y_i,\mu_i)}{-2}\right)\\
&=\arg\min_{{\mathbf{u}}}\left(\left\|{\mathbf{u}}\right\|^2 + \sum_{i=1}^n d(y_i,\mu_i)\right)
\end{aligned} ,
\tag{C.6}\] is also the *conditional mode* — the value of \({\mathbf{u}}\) that maximizes the conditional density. The expression being minimized in Equation C.6, \(\left\|{\mathbf{u}}\right\|^2 + \sum_{i=1}^n d(y_i,\mu_i)\), is called the *penalized deviance*.

Using a quadratic approximation to the penalized deviance at this conditional mode (i.e. the mode of the conditional distribution of \({\mathcal{U}}\) given \({\mathcal{Y}}={\mathbf{y}}\)) is equivalent to using a multivariate Gaussian approximation to this conditional distribution. Approximating an integral like Equation C.4 by approximating the integrand as a scaled multivariate Gaussian distribution at its mode is called Laplace’s approximation (Tierney & Kadane (1986)).

The *penalized iteratively re-weighted least squares* (PIRLS) algorithm (Section C.4) provides a fast and stable method of determining the conditional mode, \(\tilde{{\mathbf{u}}}({\mathbf{y}}|{\boldsymbol{\theta}},{\boldsymbol{\beta}})\) (Equation C.6), thereby making it feasible to use Laplace’s approximation at scale.

Before discussing PIRLS, however, we will describe generalized linear models (GLMs) without random effects (Section C.2), for which the *deviance* is defined as the sum of the unit deviances and the maximum likelihood estimate of the coefficient vector, \(\widehat{{\boldsymbol{\beta}}}\), is the value that minimizes the deviance. In Section C.3 we describe the *iteratively re-weighted least squares* (IRLS) algorithm, which is a stable, fast algorithm to minimize the deviance.

We will illustrate the IRLS algorithm with the `contra`

data discussed in Chapter 6 and a model like `com05`

, which was fit in that chapter, but without the random effects. Later we will use the full `com05`

model to illustrate some of the computations for GLMMs.

Although 0/1 responses and the Bernoulli distribution are easy to describe, the theory of the generalized linear mixed model (GLMM) and the details of the implementation are not. Readers who wish to focus on practical applications more than on the theory should feel free to skim this appendix.

Load the packages to be used

```
using AlgebraOfGraphics
using BenchmarkTools
using CairoMakie
using EmbraceUncertainty: dataset
using FreqTables
using LinearAlgebra
using MixedModels
using MixedModelsMakie
using NLopt
using PooledArrays
using StatsAPI
```

and define some constants

```
@isdefined(contrasts) || const contrasts = Dict{Symbol,Any}()
@isdefined(progress) || const progress = false
```

To introduce some terms and workflows we first consider the generalized linear model (GLM) for the Bernoulli distribution and the logit link. The linear predictor for a GLM - a model without random effects - is simply \[ {{\boldsymbol{\eta}}}= {{\mathbf{X}}}{{\boldsymbol{\beta}}} , \] and the mean response vector, \({{\boldsymbol{\mu}}}={\mathbf{g}}^{-1}({\boldsymbol{\eta}})\), is obtained by component-wise application of the scalar logistic function (Equation C.1).

The probability mass function for the Bernoulli distribution is \[ f_{{\mathcal{Y}}}(y|\mu) = \mu^y\,(1-\mu)^{(1-y)}\quad\mathrm{for}\quad y\in\{0,1\} . \]

Because the elements of \({{\mathcal{Y}}}|{{\boldsymbol{\mu}}}\) are assumed to be independent, the log-likelihood is simply the sum of contributions from each element, which, on the deviance scale, can be written in terms of the *unit deviances* \[
\begin{aligned}
-2\,\ell({{\boldsymbol{\mu}}}|{\mathbf{y}})&= -2\,\log(L({{\boldsymbol{\mu}}}|{\mathbf{y}}))\\
&=-2\,\sum_{i=1}^n y_i\log(\mu_i)+(1-y_i)\log(1-\mu_i) .
\end{aligned}
\tag{C.7}\]

As described above, it is customary when working with GLMs to convert the log-likelihood to a deviance, which, for the Bernoulli distribution, is negative twice the log-likelihood. (For other distributions, the deviance may incorporate an additional term that depends only on \({\mathbf{y}}\).)

One reason for preferring the deviance scale is that the change in deviance for nested models has approximately a \(\chi^2\) distribution with degrees of freedom determined by the number of independent constraints on the parameters in the simpler model. Especially for GLMs, the deviance plays a role similar to the sum of squared residuals in linear models.

For greater numerical precision we avoid calculating \(1-\mu\) directly when evaluating expressions like Equation C.7 and instead use \[ 1 - \mu = 1 - \frac{1}{1+e^{-\eta}}=\frac{e^{-\eta}}{1+e^{-\eta}} . \] Evaluation of the last expression provides greater precision for large negative values of \(\eta\) (corresponding to small values of \(\mu\)) than does first evaluating \(\mu\) followed by \(1 - \mu\).

After some algebra, we write the *unit deviance*, \(d(y_i,\eta_i)\), which is the contribution to the deviance from the \(i\)th observation, as \[
\begin{aligned}
d(y_i, \eta_i)&=-2\left[y_i\log(\mu_i)+(1-y_i)\log(1-\mu_i)\right]\\
&=2\left[(1-y_i)\eta_i-\log(1+e^{-\eta_i})\right]
\end{aligned}
\quad i=1,\dots,n
\]

A Julia function to evaluate both the mean and the unit deviance can be written as

```
function meanunitdev(y::T, η::T) where {T<:AbstractFloat}
= exp(-η)
expmη return (; μ=inv(1 + expmη), dev=2 * ((1 - y) * η + log1p(expmη)))
end
```

log1p

Mathematically `log1p`

, read *log of 1 plus*, is defined as \(\mathrm{log1p}(x)=\log(1+x)\) but it is implemented in such a way as to provide greater accuracy when \(x\) is small. For example,

```
let small = eps() / 10
@show small
@show 1 + small
@show log(1 + small)
@show log1p(small)
end;
```

```
small = 2.2204460492503132e-17
1 + small = 1.0
log(1 + small) = 0.0
log1p(small) = 2.2204460492503132e-17
```

`1 + small`

evaluates to `1.0`

in floating point arithmetic because of round-off, producing 0 for the expression `log(1 + small)`

, whereas `log1p(small) ≈ small`

, as it should be.

This function returns a `NamedTuple`

of values from scalar arguments. For example,

`meanunitdev(0.0, 0.21)`

`(μ = 0.5523079095743253, dev = 1.6072991620435673)`

A `Vector`

of such `NamedTuple`

s is a *row-table* (Section A.2), which can be updated in place by dot-vectorization of the scalar `meanunitdev`

function, as shown below.

We illustrate some of these computations using only the fixed-effects specification for `com05`

, a GLMM fit to the `contra`

data set in Chapter 6. Because we will use the full GLMM later we reproduce `com05`

by loading the data, creating the binary `ch`

variable indicating children/no-children, defining the contrasts and formula to be used, and fitting the model as in Chapter 6.

```
= let tbl = dataset(:contra)
contra Table(tbl; ch=tbl.livch .≠ "0")
end
:urban] = HelmertCoding()
contrasts[:ch] = HelmertCoding()
contrasts[=
com05 let d = contra,
= Bernoulli(),
ds = @formula(
f ~ 1 + urban + ch * age + age & age + (1 | dist & urban)
use
)
fit(MixedModel, f, d, ds; contrasts, nAGQ=9, progress)
end
```

Extract the fixed-effects model matrix, \({\mathbf{X}}\), and initialize the coefficient vector, \({\boldsymbol{\beta}}\), to a copy (in case we modify it) of the estimated fixed-effects.

`= copy(com05.β) βm05 `

```
6-element Vector{Float64}:
-0.3414913998306781
0.3936080536502067
0.6064861079468472
-0.012911714642169572
0.03321662487439253
-0.005625046845040066
```

As stated above, the `meanunitdev`

function can be applied to the vectors, \({\mathbf{y}}\) and \({{\boldsymbol{\eta}}}\), via dot-vectorization to produce a `Vector{NamedTuple}`

, which is the typical form of a row-table.

```
= meanunitdev.(com05.y, com05.X * βm05)
rowtbl typeof(rowtbl)
```

`Vector{@NamedTuple{μ::Float64, dev::Float64}} (alias for Array{@NamedTuple{μ::Float64, dev::Float64}, 1})`

For display we convert the row-table to a column-table and prepend another column-table consisting of \({\mathbf{y}}\) and \({\boldsymbol{\eta}}\).

`Table((; y=com05.y, η=com05.X * βm05), rowtbl) # display as a Table`

```
Table with 4 columns and 1934 rows:
y η μ dev
┌───────────────────────────────────
1 │ 0.0 -0.87968 0.293244 0.69414
2 │ 0.0 -0.471786 0.384194 0.969645
3 │ 0.0 0.676178 0.662885 2.17466
4 │ 0.0 0.429284 0.605703 1.8613
5 │ 0.0 -0.963167 0.276245 0.646604
6 │ 0.0 -0.772821 0.315869 0.759212
7 │ 0.0 -0.87968 0.293244 0.69414
8 │ 0.0 0.515028 0.625984 1.96692
9 │ 0.0 0.371817 0.591898 1.79248
10 │ 0.0 0.676178 0.662885 2.17466
11 │ 1.0 -0.772821 0.315869 2.30485
12 │ 0.0 -0.473145 0.383872 0.968601
13 │ 0.0 0.449047 0.610413 1.88533
14 │ 0.0 0.602596 0.64625 2.07833
15 │ 0.0 0.645468 0.655988 2.13416
16 │ 1.0 0.637867 0.654271 0.848467
17 │ 0.0 -0.471786 0.384194 0.969645
⋮ │ ⋮ ⋮ ⋮ ⋮
```

The deviance for this value of \({{\boldsymbol{\beta}}}\) in this model is the sum of the unit deviances, which we write as `sum`

applied to a generator expression. (In general we extract columns of a row-table with generator expressions that produce iterators.)

`sum(r.dev for r in rowtbl)`

`2411.194470806229`

When minimizing the deviance it is convenient to have the different components of the model encapsulated in a user-created `struct`

type so we can update the parameter values and evaluate the deviance without needing to keep track of all the pieces of the model.

```
struct BernoulliGLM{T<:AbstractFloat}
::Matrix{T}
X::Vector{T}
β::NamedTuple{(:y, :η),NTuple{2,Vector{T}}}
ytbl::Vector{NamedTuple{(:μ, :dev),NTuple{2,T}}}
rtblend
```

We also create an *external constructor*, which is a function defined outside the struct and of the same name as the struct, that constructs and returns an object of that type. In this case the external constructor creates a `BernoulliGLM`

from the model matrix and the response vector, after some consistency checks on the arguments passed to it.

```
function BernoulliGLM(
::Matrix{T},
X::Vector{T},
ywhere {T<:AbstractFloat}
)
# check consistency of arguments
= size(X, 1) # number of rows in X
n if length(y) ≠ n || any(!in([0, 1]), y)
throw(ArgumentError("y is not an $n-vector of 0's and 1's"))
end
# initial β from linear regression of y in {-1,1} coding
= X \ replace(y, 0 => -1)
β = X * β
η
return BernoulliGLM(X, β, (; y, η), meanunitdev.(y, η))
end
```

To optimize the deviance we define an *extractor* method that returns the deviance

`deviance(m::BernoulliGLM) = sum(r.dev for r in m.rtbl) StatsAPI.`

Why StatsAPI.deviance and not just deviance?

This extractor is written as a method for the generic `deviance`

function defined in the `StatsAPI`

package. Doing so allows us to use the `deviance`

name for the extractor without interfering with `deviance`

methods defined for other model types.

We also define a mutating function, `setβ!`

, that installs a new value of `β`

then updates `η`

and `rtbl`

in place.

```
function setβ!(m::BernoulliGLM, newβ)
= m.ytbl # destructure ytbl
(; y, η) mul!(η, m.X, copyto!(m.β, newβ)) # η = X * newβ in place
.= meanunitdev.(y, η) # update rtbl in place
m.rtbl return m
end
```

Create such a struct from `X`

and `y`

for model `com05`

.

```
= BernoulliGLM(com05.X, com05.y)
com05fe = copy(com05fe.β) # keep a copy of the initial values β₀
```

```
6-element Vector{Float64}:
-0.10753340611835722
0.17596632075206525
0.21944824496978524
-0.0028547435517881996
0.010139921834784385
-0.002161831532409504
```

These initial values of \({\boldsymbol{\beta}}\) are from a least squares fit of \({\mathbf{y}}\), converted from `{0,1}`

coding to `{-1,1}`

coding, on the model matrix, \({\mathbf{X}}\).

As a simple test of the `setβ!`

and `deviance`

methods we can check that `com05fe`

produces the same deviance value for `βm05`

as was evaluated above.

`deviance(setβ!(com05fe, βm05))`

`2411.194470806229`

For fairness in later comparisons we restore the initial values `β₀`

to the model. These are rough starting estimates with a deviance that is considerably greater than that at `βm05`

.

`deviance(setβ!(com05fe, β₀))`

`2491.1514390537254`

We can use a general optimizer like those available in NLopt.jl to minimize the deviance. Following the instructions given at that package’s repository, we create an `Opt`

object specifying the algorithm to be used, BOBYQA (Powell (2009)), and the dimension of the problem, then define and assign the objective function in the required form, and call `optimize`

```
function StatsAPI.fit!(m::BernoulliGLM{T}) where {T}
= Opt(:LN_BOBYQA, length(m.β))
opt function objective(x::Vector{T}, g::Vector{T}) where {T}
isempty(g) || throw(
ArgumentError("Gradient not available, g must be empty"),
)return deviance(setβ!(m, x))
end
= objective
opt.min_objective = optimize(opt, copy(m.β))
minf, minx, ret @info (; code=ret, nevals=opt.numevals, minf)
return m
end
```

`fit!(com05fe);`

`[ Info: (code = :ROUNDOFF_LIMITED, nevals = 558, minf = 2409.377428160017)`

The optimizer has determined a coefficient vector that reduces the deviance to 2409.38, at which point convergence was declared because changes in the objective are limited by round-off. This required about 500 evaluations of the deviance at candidate values of \({\boldsymbol{\beta}}\).

Each evaluation of the deviance is fast, requiring only a fraction of a millisecond on a laptop computer,

```
= copy(com05fe.β)
βopt @benchmark deviance(setβ!(m, β)) seconds = 1 setup =
= com05fe; β = βopt) (m
```

```
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 33.798 μs … 113.439 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 36.778 μs ┊ GC (median): 0.00%
Time (mean ± σ): 36.707 μs ± 1.706 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ ▁ ▁▃▆▂ ▄▇█▂ ▁▁ ▂
▆█▇▁▁▁▁▁▆▇█▅▄▁▃▄▃████▄▁▁▃▃▅████▇▅▄▆▅▄▆▆███▇▅▆▅▅▅▃▁▄▄▃▃▃▄▅▅▆▆ █
33.8 μs Histogram: log(frequency) by time 40 μs <
Memory estimate: 16 bytes, allocs estimate: 1.
```

but the already large number of evaluations for these six coefficients would not scale well as this dimension increases.

Fortunately there is an algorithm, called *iteratively reweighted least squares* (IRLS), that uses the special structure of the GLM to provide fast and stable convergence to estimates of the coefficients, even for models with a large number of coefficients. This will be important to us in fitting GLMMs where we must optimize with respect to the random effects, whose dimension can be large.

As we have seen, in a GLM we are modeling the responses and the predicted values on two scales — the *linear predictor scale*, for \({\boldsymbol{\eta}}\), and the *response scale*, for \({\mathbf{y}}\) and \({\boldsymbol{\mu}}\). The scalar link function, \(\eta=g(\mu)\), and the inverse link, \(\mu=g^{-1}(\eta)\), map vectors component-wise between these two scales.

For operations like determining a new candidate value of \({\boldsymbol{\beta}}\), the linear predictor scale is preferred, because, on that scale, \({\boldsymbol{\eta}}={\mathbf{X}}{\boldsymbol{\beta}}\) is a linear function of \({\boldsymbol{\beta}}\). Thus it would be convenient if we could transform the response, \({\mathbf{y}}\), to the linear predictor scale where we could define a residual and use some form of minimizing a sum of squared residuals to evaluate a new coefficient vector (or, alternatively, evaluate an increment that will be added to the current coefficient vector). Unfortunately, a naive approach of transforming \({\mathbf{y}}\) to the linear predictor scale won’t work because the elements of \({\mathbf{y}}\) are all \(0\) or \(1\) and the logit link function maps these values to \(-\infty\) and \(\infty\), respectively.

For an iterative algorithm, however, we can use a local linear approximation to the link function to define a *working residual*, from which to evaluate an increment to the coefficient vector, or a *working response*, from which we evaluate the new coefficient vector directly. Because the link and inverse link functions are defined component-wise we will define the approximation for scalars \(y_i\), \(\mu_i\), and \(\eta_i\) and for the scalar link function, \(g\), with the understanding that these definitions apply component-wise to the vectors.

The *working residual* is evaluated by mapping the residual on the response scale, \(y_i-\mu_i\), through the linear approximation to the link, \(g(\mu)\), at \(\mu_i\). That is, \[
\tilde{r_i}=(y_i-\mu_i)g'(\mu_i)\quad i=1,\dots,n .
\] Because the derivative, \(g'(\mu_i)\), for the logit link function is \(1/[\mu_i(1-\mu_i)]\), these working residuals are \[
\tilde{r}_i = (y_i-\mu_i)g'(\mu_i) = \frac{y_i - \mu_i}{\mu_i(1-\mu_i)}\quad i=1,\dots,n .
\] Similarly, the *working response* on the linear predictor scale, is defined by adding the working residual to the current linear predictor value, \[
\tilde{y_i}=\eta_i + \tilde{r_i}=\eta_i +(y_i-\mu_i)g'(\mu_i)=
\eta_i + \frac{y_i - \mu_i}{\mu_i(1-\mu_i)}\quad i=1,\dots,n .
\]

On the linear predictor scale we can fit a linear model to the working response to obtain a new parameter vector, but we must take into account that the variances of the *noise terms* in this linear model, which are the working residuals, are not constant. We use *weighted least squares* where the weights are inversely proportional to the variance of the working residual. The variance of the random variable \({\mathcal{Y}}_i\) is \(\mu_i(1-\mu_i)\), hence the variance of the working residual is \[
\mathrm{Var}(\tilde{r_i})=g'(\mu_i)^2 \mathrm{Var}({\mathcal{Y}}_i)=\frac{\mu_i(1-\mu_i)}{\left[\mu_i(1-\mu_i)\right]^2}=\frac{1}{\mu_i(1-\mu_i)}
\quad i=1,\dots,n .
\]

Thus the working weights are \[ \begin{aligned} w_i&=\mu_i(1-\mu_i)\\ &=\frac{1}{1+e^{-\eta_i}}\frac{e^{-\eta_i}}{1+e^{-\eta_i}} \end{aligned} ,\quad i=1,\dots,n. \]

In practice we will use the square roots of the working weights, evaluated as \[ \sqrt{w_i}=\frac{\sqrt{e^{-\eta_i}}}{1+e^{-\eta_i}}=\frac{e^{-\eta_i/2}}{1+e^{-\eta_i}}\quad i=1,\dots,n . \]

Note that \(\mathrm{Var}({\mathcal{Y}}_i)\) happens to be the inverse of \(g'(\mu_i)\) for a Bernoulli response and the logit link function. This will always be true for distributions in the exponential family and their canonial links.

At the \(k\)th iteration the IRLS algorithm updates the coefficient vector to \({\boldsymbol{\beta}}^{(k)}\), which is a weighted least squares solution that could be written as \[ {\boldsymbol{\beta}}^{(k)}= \left({\mathbf{X}}'{\mathbf{W}}{\mathbf{X}}\right)^{-1}\left({\mathbf{X}}'{\mathbf{W}}\tilde{{\mathbf{y}}}\right) , \] where \({\mathbf{W}}\) is an \(n\times n\) diagonal matrix of the working weights and \(\tilde{{\mathbf{y}}}\) is the working response, both evaluated at \({\boldsymbol{\beta}}^{(k-1)}\), the coefficient vector from the previous iteration.

In practice we use the square roots of the working weights, which we write as a diagonal matrix, \({\mathbf{W}}^{1/2}\), and a QR decomposition (Section B.6) of a weighted model matrix, \({\mathbf{W}}^{1/2}{\mathbf{X}}\), to solve for the updated coefficient vector from the weighted working response, \({\mathbf{W}}^{1/2}\tilde{{\mathbf{y}}}\), with elements \[ \begin{aligned} \sqrt{w_i}(\eta_i+\tilde{r}_i)&=\sqrt{\mu_i(1-\mu_i)}(\eta_i+\tilde{r}_i)\\ &=\sqrt{w_i}\eta_i +\frac{(y_i-\mu_i)\sqrt{\mu_i(1-\mu_i)}}{\mu_i(1-\mu_i)}\\ &=\sqrt{w_i}\eta_i +\frac{y_i-\mu_i}{\sqrt{w_i}} \end{aligned},\quad i=1,\dots,n \]

It is possible to write the IRLS algorithm using a weighted least squares fit of the working residuals on the model matrix to determine a parameter increment. However, in the PIRLS algorithm it is necessary to use the working response, not the working residual, so we define the IRLS algorithm in those terms too.

Furthermore, in the PIRLS algorithm we will need to allow for an *offset* when calculating the working response. In the presence of an offset, \({\mathbf{o}}\), a constant vector of length \(n\), the linear predictor is defined as \[
{\boldsymbol{\eta}}= {\mathbf{o}}+ {\mathbf{X}}{\boldsymbol{\beta}}.
\] The mean, \({\boldsymbol{\mu}}\), the working weights and the working residuals are defined as before but the working response becomes \[
\tilde{{\mathbf{y}}}=\tilde{{\mathbf{r}}} + {\boldsymbol{\eta}}- {\mathbf{o}}.
\]

For a linear model there is rarely a reason for using an offset. Instead we can simply subtract the constant vector, \({\mathbf{o}}\), from the response, \({\mathbf{y}}\), because the response and the linear predictor are on the same scale. However, this is not the case for a GLM where we must deal with the effects of the constant offset on the linear predictor scale, not on the response scale.

We define a `BernoulliIRLS`

struct with three additional elements in the rowtable: the square roots of the working weights, `rtwwt`

, the weighted working residuals, `wwres`

, and the weighted working response, `wwresp`

. In the discussion above, `rtwwt`

is the diagonal of \({\mathbf{W}}^{1/2}\), `wwres`

is \({\mathbf{W}}^{1/2}\tilde{{\mathbf{r}}}\) and `wwresp`

is \({\mathbf{W}}^{1/2}\tilde{{\mathbf{y}}}\).

We also add fields `Xqr`

, in which the weighted model matrix, \({\mathbf{W}}^{1/2}{\mathbf{X}}\), is formed followed by its QR decomposition, and `βcp`

, which holds a copy of the previous coefficient vector.

```
struct BernoulliIRLS{T<:AbstractFloat}
::Matrix{T}
X::Matrix{T} # copy of X used in the QR decomp
Xqr::Vector{T}
β::Vector{T} # copy of previous β
βcp::Diagonal{T,Vector{T}} # rtwwt as a Diagonal matrix
Whalf::NamedTuple{(:y, :η),NTuple{2,Vector{T}}}
ytbl::Vector{
rtbl NamedTuple{(:μ, :dev, :rtwwt, :wwres, :wwresp),NTuple{5,T}},
}
end
```

with constructor

```
function BernoulliIRLS(
::Matrix{T},
X::Vector{T},
ywhere {T<:AbstractFloat}
) = size(X, 1) # number of rows of X
n if length(y) ≠ n || !all(v -> (iszero(v) || isone(v)), y)
throw(ArgumentError("y is not an $n-vector of 0's and 1's"))
end
# initial β from linear least squares fit of y in {-1,1} coding
= copy(X)
Xqr = qr!(Xqr) \ replace(y, 0 => -1)
β = copy(β)
βcp = X * β
η = tblrow.(y, η)
rtbl = Diagonal([r.rtwwt for r in rtbl])
Whalf return BernoulliIRLS(X, Xqr, β, βcp, Whalf, (; y, η), rtbl)
end
```

The `tblrow`

function evaluates the mean, unit deviance, square root of the weight, and the weighted, working residual and weighted, working response for scalar \(y\) and \(\eta\). The `offset`

argument, which defaults to zero, is not used in calls for `BernoulliIRLS`

models, but will be used in Section C.4 when we discuss the PIRLS algorithm.

```
function tblrow(
::T,
y::T,
η::T=zero(T),
offsetwhere {T<:AbstractFloat}
) = exp(-η / 2) # square root of exp(-η)
rtexpmη = abs2(rtexpmη) # exp(-η)
expmη = 1 + expmη
denom = inv(denom)
μ = 2 * ((1 - y) * η + log1p(expmη))
dev = rtexpmη / denom # sqrt of working wt
rtwwt = (y - μ) / rtwwt # weighted working resid
wwres = wwres + rtwwt * (η - offset)
wwresp return (; μ, dev, rtwwt, wwres, wwresp)
end
```

`deviance(m::BernoulliIRLS) = sum(r.dev for r in m.rtbl) StatsAPI.`

Next we define a mutating function, `updateβ!`

, that evaluates \({\boldsymbol{\beta}}^{(k)}\), the updated coefficient vector at iteration \(k\), in place by weighted least squares then updates the response table.

```
function updateβ!(m::BernoulliIRLS)
= m # destructure m & ytbl
(; X, Xqr, β, βcp, Whalf, ytbl, rtbl) = ytbl
(; y, η) copyto!(βcp, β) # keep a copy of β
copyto!(Whalf.diag, r.rtwwt for r in rtbl) # rtwwt -> Whalf
mul!(Xqr, Whalf, X) # weighted model matrix
copyto!(η, r.wwresp for r in rtbl) # use η as temp storage
ldiv!(β, qr!(Xqr), η) # weighted least squares
.= tblrow.(y, mul!(η, X, β)) # update η and rtbl
rtbl return m
end
```

For our example, we start at the same coefficient vector as we did with the general optimizer.

```
= BernoulliIRLS(com05.X, com05.y)
com05fe deviance(com05fe)
```

`2491.1514390537254`

The first IRLS iteration

`deviance(updateβ!(com05fe))`

`2411.165742459929`

reduces the deviance substantially.

We create a `fit!`

method to iterate to convergence.

```
function StatsAPI.fit!(m::BernoulliIRLS, β₀=m.β; verbose::Bool=true)
= m
(; X, β, βcp, ytbl, rtbl) = ytbl
(; y, η) .= tblrow.(y, mul!(η, X, copyto!(β, β₀)))
rtbl = deviance(m)
olddev && @info 0, olddev # record the deviance at initial β
verbose for i in 1:100 # perform at most 100 iterations
= deviance(updateβ!(m))
newdev && @info i, newdev # iteration number and deviance
verbose if newdev > olddev
@warn "failure to decrease deviance"
copyto!(β, βcp) # roll back changes to β, η, and rtbl
= tblrow.(y, mul!(η, X, β))
rtbl break
elseif (olddev - newdev) < (1.0e-10 * olddev)
break # exit loop if deviance is stable
else
= newdev
olddev end
end
return m
end
```

`fit!(com05fe, β₀);`

```
[ Info: (0, 2491.1514390537254)
[ Info: (1, 2411.165742459929)
[ Info: (2, 2409.382451337132)
[ Info: (3, 2409.3774282835857)
[ Info: (4, 2409.3774281600413)
```

The IRLS algorithm has converged in 4 iterations to essentially the same deviance as the general optimizer achieved after around 500 function evaluations. Each iteration of the IRLS algorithm takes more time than a deviance evaluation, but still only a fraction of a millisecond on a laptop computer.

`@benchmark deviance(updateβ!(m)) seconds = 1 setup = (m = com05fe)`

```
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 67.528 μs … 161.118 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 76.638 μs ┊ GC (median): 0.00%
Time (mean ± σ): 76.986 μs ± 2.640 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▂▂▄▅▅▄▇██▆▅▄▄▅▅▄▃▂▂▂▂▂▁▁▁ ▂
▅▅▆▆▇▇▇▆▆▆▆▆▆▆▅▆▇▇▇██▇██████████████████████████████▇▇▇▇▇▇▇▇ █
67.5 μs Histogram: log(frequency) by time 84.5 μs <
Memory estimate: 16.12 KiB, allocs estimate: 6.
```

In Section B.7 we showed that, given a value of \({\boldsymbol{\theta}}\), which determines the relative covariance factor, \({\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}\), of the random effects, \({\mathcal{B}}\), the *conditional mode*, \(\tilde{{\mathbf{b}}}\), of the random effects can be evaluated as the solution to a *penalized least squares* (PLS) problem. It is convenient to write the PLS problem in terms of the *spherical random effects*, \({\mathcal{U}}\sim{\mathcal{N}}({\mathbf{0}},\sigma^2{\mathbf{I}})\), with the defining relationship \({\mathcal{B}}={\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}{\mathcal{U}}\), as in Equation B.38 \[
\tilde{{\mathbf{u}}}=\arg\min_{{\mathbf{u}}}\left(
\left\|{\mathbf{y}}-{\mathbf{X}}{\boldsymbol{\beta}}-{\mathbf{Z}}{\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}{\mathbf{u}}\right\|^2 +
\left\|{\mathbf{u}}\right\|^2
\right) .
\]

We wrote Equation B.38 for the LMM case as minimizing the penalized sum of squared residuals with respect to both \({\boldsymbol{\beta}}\) and \({\mathbf{u}}\). Here, and in Equation C.8 below, we minimize with respect to \({\mathbf{u}}\) only while holding \({\boldsymbol{\beta}}\) fixed.

The solution of this PLS problem, \(\tilde{\mathbf{u}}\), is the *conditional mode* of \({\mathcal{U}}\), in that it maximizes the density of the conditional distribution, \(({\mathcal{U}}|{\mathcal{Y}}={\mathbf{y}})\), at the observed \({\mathbf{y}}\). (In the case of a LMM, where the conditional distributions, \(({\mathcal{B}}|{\mathcal{Y}}={\mathbf{y}})\) and \(({\mathcal{U}}|{\mathcal{Y}}={\mathbf{y}})\), are multivariate Gaussian, the solution of the PLS problem is also the *mean* of the conditional distribution, but this property doesn’t carry over to GLMMs.)

In a Bernoulli generalized linear mixed model (GLMM) the mode of the conditional distribution, \(({\mathcal{U}}|{\mathcal{Y}}={\mathbf{y}})\), minimizes the *penalized GLM deviance*, \[
\tilde{{\mathbf{u}}}=\arg\min_{{\mathbf{u}}}\left(
\left\|{\mathbf{u}}\right\|^2+\sum_{i-1}^n d(y_i,\eta_i({\mathbf{u}}))
\right) ,
\tag{C.8}\] where \(d(y_i,\eta_i),\,i=1,\dots,n\) are the unit deviances defined in Section C.2. We modify the IRLS algorithm as *penalized iteratively re-weighted least squares* (PIRLS) to determine these values.

As with IRLS, each iteration of the PIRLS algorithm involves using the current linear predictor, \({\boldsymbol{\eta}}({\mathbf{u}})={\mathbf{X}}{\boldsymbol{\beta}}+{\mathbf{Z}}{\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}{\mathbf{u}}\), (\({\boldsymbol{\beta}}\) and \({\boldsymbol{\theta}}\) are assumed known and fixed, and \({\mathbf{X}}{\boldsymbol{\beta}}\) is an offset) to evaluate the mean, \({\boldsymbol{\mu}}={\mathbf{g}}^{-1}({\boldsymbol{\eta}})\), of the conditional distribution, \(({\mathcal{Y}}|{\mathcal{U}}={\mathbf{u}})\), as well as the unit deviances, \(d(y_i,\eta_i)\), the square roots of the working weights, which are on the diagonal of \({\mathbf{W}}^{1/2}\), and the weighted, working response, \({\mathbf{W}}^{1/2}\tilde{{\mathbf{y}}}\). The updated spherical random effects vector, \({\mathbf{u}}\), is the solution to \[ ({\boldsymbol{\Lambda}}'{\mathbf{Z}}'{\mathbf{W}}{\mathbf{Z}}{\boldsymbol{\Lambda}}+{\mathbf{I}}){\mathbf{u}}={\boldsymbol{\Lambda}}'{\mathbf{Z}}'{\mathbf{W}}\tilde{{\mathbf{y}}} \] and is evaluated using the Cholesky factor, \({\mathbf{L}}\), of \({\boldsymbol{\Lambda}}'{\mathbf{Z}}'{\mathbf{W}}{\mathbf{Z}}{\boldsymbol{\Lambda}}+{\mathbf{I}}\).

As in the solution of the PLS problem in Section B.7, the fact that \({\mathbf{Z}}\) is sparse and that the sparsity is also present in \({\mathbf{L}}\), makes it feasible to solve for \({\mathbf{u}}\) even when its dimension is large.

To illustrate the calculations we again use the `com05`

model, which has a single, scalar random-effects term, `(1 | dist & urban)`

, in its formula. The matrix \({\mathbf{Z}}\) is displayed as

```
= only(com05.reterms)
com05re Int.(collect(com05re)) # Int values for compact printing
```

```
1934×102 Matrix{Int64}:
1 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮
0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
```

but internally it is stored much more compactly because it is an *indicator matrix* (also called *one-hot* encoding), which means that all \(Z_{i,j}\in\{0,1\}\) and in each row there is exactly one value that is one (and all the others are zero). The column in which the non-zero element of each row occurs is given as an integer vector in the `refs`

property of `com05re`

.

`' # transpose for compact printing com05re.refs`

```
1×1934 adjoint(::Vector{Int32}) with eltype Int32:
1 1 1 1 1 1 1 1 1 1 1 1 1 … 102 102 102 102 102 102 102
```

We define a struct

```
struct BernoulliPIRLS{T<:AbstractFloat,S<:Integer}
::Matrix{T}
X::Vector{T}
θβ::NamedTuple{ # column-table
ytbl (:refs, :y, :η, :offset),
Tuple{Vector{S},Vector{T},Vector{T},Vector{T}},
}
::NamedTuple{ # column-table
utbl (:u, :u0, :Ldiag, :pdev, :pdev0, :aGHQ),
NTuple{6,Vector{T}},
}
::Vector{ # row-table
rtbl NamedTuple{(:μ, :dev, :rtwwt, :wwres, :wwresp),NTuple{5,T}},
}
end
```

with an external constructor

```
function BernoulliPIRLS(
::Matrix{T},
X::Vector{T},
y::Vector{S},
refswhere {T<:AbstractFloat,S<:Integer}
) # use IRLS to check X and y, obtain initial β, and establish rtbl
= fit!(BernoulliIRLS(X, y); verbose=false)
irls = irls.β
β = append!(ones(T, 1), β) # initial θ = 1
θβ = X * β
η
# refs should contain all values from 1 to maximum(refs)
= sort!(unique(refs))
refvals = length(refvals)
q if refvals ≠ 1:q
throw(ArgumentError("sort!(unique(refs)) must be 1:$q"))
end
length(refs) == length(y) ||
throw(ArgumentError("lengths of y and refs aren't equal"))
= (; refs, y, η, offset=copy(η))
ytbl
= NamedTuple(
utbl => zeros(T, q) for
nm in (:u, :u0, :Ldiag, :pdev, :pdev0, :aGHQ)
nm
)return updatetbl!(BernoulliPIRLS(X, θβ, ytbl, utbl, irls.rtbl))
end
```

Why are θ and β stored in a single vector?

The reason for storing both \({\boldsymbol{\theta}}\) and \({\boldsymbol{\beta}}\) in a single vector is to provide for their simultaneous optimization with an optimizer such as those in NLopt.jl.

The `utbl`

field in a `BernoulliPIRLS`

struct contains vectors named `u0`

, `pdev`

, `pdev0`

, and `aGHQ`

, in addition to `u`

and `Ldiag`

. These are not used in the PIRLS algorithm (other than for keeping a copy of the previous \({\mathbf{u}}\)) or for optimizing Laplace’s approximation to the objective. However, they will be used in adaptive Gauss-Hermite quadrature evaluation of the objective (Section C.6), so we keep them in the struct throughout.

The `updatetbl!`

method for this type first evaluates \({\boldsymbol{\eta}}\) via a “virtual” multiplication that forms \({\mathbf{Z}}{\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}{\mathbf{u}}\) plus the stored `offset`

, which is \({\mathbf{X}}{\boldsymbol{\beta}}\), then updates the rowtable from \({\mathbf{y}}\), \({\boldsymbol{\eta}}\), and the offset. For this model \({\boldsymbol{\theta}}\) is one-dimensional and \({\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}\) is a scalar multiple of \({\mathbf{I}}_q\), the identity matrix of size \(q\), and thus the matrix multiplication by \({\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}\) can be expressed as scalar products.

```
function updatetbl!(m::BernoulliPIRLS)
= m.ytbl
(; refs, y, η, offset) = m.utbl.u
u = first(m.θβ)
θ # evaluate η = offset + ZΛu where Λ is θ * I and Z is one-hot
fill!(η, 0)
@inbounds for i in eachindex(η, refs, offset)
+= offset[i] + u[refs[i]] * θ
η[i] end
.= tblrow.(y, η, offset)
m.rtbl return m
end
```

The `pdeviance`

method returns the deviance for the GLM model plus the penalty on the squared length of `u`

.

```
function pdeviance(m::BernoulliPIRLS)
return sum(r.dev for r in m.rtbl) + sum(abs2, m.utbl.u)
end
```

The `updateu!`

method is similar to `updateβ!`

for the `BernoulliIRLS`

type except that it is based on the diagonal matrix \({\boldsymbol{\Lambda}}'{\mathbf{Z}}'{\mathbf{W}}{\mathbf{Z}}{\boldsymbol{\Lambda}}+ {\mathbf{I}}\). Only the diagonal elements of this matrix are constructed and used to solve for the updated \({\mathbf{u}}\) vector. At convergence of the PIRLS algorithm the elements of `Ldiag`

are replaced by their square roots.

```
function updateu!(m::BernoulliPIRLS)
= m.utbl
(; u, u0, Ldiag) copyto!(u0, u) # keep a copy of u
= first(m.θβ) # extract the scalar θ
θ fill!(u, 0)
if iszero(θ) # skip the update if θ == 0
fill!(Ldiag, 1) # L is the identity if θ == 0
return updatetbl!(m)
end
fill!(Ldiag, 0)
@inbounds for (ri, ti) in zip(m.ytbl.refs, m.rtbl)
= θ * ti.rtwwt # non-zero in i'th row of √WZΛ
rtWΛ += abs2(rtWΛ) # accumulate Λ'Z'WZΛ
Ldiag[ri] += rtWΛ * ti.wwresp # accumulate Λ'Z'Wỹ
u[ri] end
.+= 1 # form diagonal of Λ'Z'WZΛ + I = LL'
Ldiag ./= Ldiag # solve for u with diagonal LL'
u return updatetbl!(m) # and update η and rtbl
end
```

Create a `BernoulliPIRLS`

struct for the `com05`

model and check the penalized deviance at the initial values

```
= BernoulliPIRLS(com05.X, com05.y, only(com05.reterms).refs)
m pdeviance(m)
```

`2409.377428160041`

As with IRLS, the first iteration of PIRLS reduces the objective, which is the penalized deviance in this case, substantially.

`pdeviance(updateu!(m))`

`2233.1209476357844`

Create a `pirls!`

method for this struct.

```
function pirls!(m::BernoulliPIRLS; verbose::Bool=false)
= m.utbl
(; u, u0, Ldiag) fill!(u, 0) # start from u == 0
copyto!(u0, u) # keep a copy of u
= pdeviance(updatetbl!(m))
oldpdev && @info 0, oldpdev
verbose for i in 1:10 # maximum of 10 PIRLS iterations
= pdeviance(updateu!(m))
newpdev && @info i, newpdev
verbose if newpdev > oldpdev # PIRLS iteration failed
@warn "PIRLS iteration did not reduce penalized deviance"
copyto!(u, u0) # restore previous u
updatetbl!(m) # restore η and rtbl
break
elseif (oldpdev - newpdev) < (1.0e-8 * oldpdev)
copyto!(u0, u) # keep a copy of u
break
else
copyto!(u0, u) # keep a copy of u
= newpdev
oldpdev end
end
map!(sqrt, Ldiag, Ldiag) # replace diag(LL') by diag(L)
return m
end
```

The PIRLS iterations always start from \({\mathbf{u}}=\mathbf{0}\) so that the converged value of the penalized deviance is reproducible for given values of \(\theta\) and \({\boldsymbol{\beta}}\). If we allowed the algorithm to start at whatever values are currently stored in \({\mathbf{u}}\) then there could be slight differences in the value of the penalized deviance at convergence of PIRLS, which can cause problems when trying to optimize with respect to \(\theta\) and \({\boldsymbol{\beta}}\).

`pirls!(m; verbose=true);`

```
[ Info: (0, 2409.377428160041)
[ Info: (1, 2233.1209476357844)
[ Info: (2, 2231.605935279706)
[ Info: (3, 2231.6002198321576)
[ Info: (4, 2231.600219406561)
```

As with IRLS, PIRLS is a fast and stable algorithm for determining the mode of the conditional distribution \(({\mathcal{U}}|{\mathcal{Y}}={\mathbf{y}})\) with \({\boldsymbol{\theta}}\) and \({\boldsymbol{\beta}}\) held fixed.

`@benchmark pirls!(mm) seconds = 1 setup = (mm = m)`

```
BenchmarkTools.Trial: 4402 samples with 1 evaluation.
Range (min … max): 199.693 μs … 331.599 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 227.186 μs ┊ GC (median): 0.00%
Time (mean ± σ): 226.091 μs ± 5.569 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▇█
▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▁▂▂▁▁▂▂▂▂▁▂▁▁▁▁▄█▄▄▃▂▃▃▃▃▃██▇▆▃▃▅▄▃▂▂▂▂ ▃
200 μs Histogram: frequency by time 234 μs <
Memory estimate: 112 bytes, allocs estimate: 1.
```

The time taken for the four iterations to determine the conditional mode of \({\mathbf{u}}\) is comparable to the time taken for a single call to `updateβ!`

. Most of the time in `updateβ!`

is spent in the QR factorization to solve the weighted least squares problem, whereas in `updateu!`

and thus in `pirls!`

, we take advantage of the fact that the solution of the penalized, weighted least squares problem is based on a diagonal matrix.

The PIRLS algorithm determines the value of \({\mathbf{u}}\) that minimizes the penalized deviance \[ \tilde{{\mathbf{u}}}=\arg\min_{{\mathbf{u}}}\left(\left\|{\mathbf{u}}\right\|^2+\sum_{i=1}^n d(y_i,\eta_i)\right) , \] where \(\eta_i, i=1,\dots,n\) is the \(i\)th component of \({\boldsymbol{\eta}}={\mathbf{X}}{\boldsymbol{\beta}}+{\mathbf{Z}}{\boldsymbol{\Lambda}}{\mathbf{u}}\). A quadratic approximation to the penalized deviance at \(\tilde{{\mathbf{u}}}\) is \[ \begin{aligned} \tilde{d}({\mathbf{y}},{\mathbf{u}})&=\|{\mathbf{u}}\|^2+\sum_{i=1}^n d(y_i,\eta_i)\\ &\approx\|\tilde{{\mathbf{u}}}\|^2+\sum_{i=1}^n d(y_i,\tilde{\eta}_i)+ ({\mathbf{u}}-\tilde{{\mathbf{u}}})'({\boldsymbol{\Lambda}}'{\mathbf{Z}}'{\mathbf{W}}{\mathbf{Z}}{\boldsymbol{\Lambda}}+{\mathbf{I}})({\mathbf{u}}-\tilde{{\mathbf{u}}})\\ &=\|\tilde{{\mathbf{u}}}\|^2+\sum_{i=1}^n d(y_i,\tilde{\eta}_i)+ ({\mathbf{u}}-\tilde{{\mathbf{u}}})'{\mathbf{L}}{\mathbf{L}}'({\mathbf{u}}-\tilde{{\mathbf{u}}})\\ &=\tilde{d}({\mathbf{y}},\tilde{{\mathbf{u}}})+ ({\mathbf{u}}-\tilde{{\mathbf{u}}})'{\mathbf{L}}{\mathbf{L}}'({\mathbf{u}}-\tilde{{\mathbf{u}}}) \end{aligned} \tag{C.9}\] where \({\mathbf{L}}\) is the lower Cholesky factor of \({\boldsymbol{\Lambda}}'{\mathbf{Z}}'{\mathbf{W}}{\mathbf{Z}}{\boldsymbol{\Lambda}}+{\mathbf{I}}\). (In Equation C.9 the linear term in \(({\mathbf{u}}-\tilde{{\mathbf{u}}})\) that would normally occur in such an expression is omitted because the gradient of \(\tilde{d}({\mathbf{y}},{\mathbf{u}})\) is zero at \(\tilde{{\mathbf{u}}}\).)

Laplace’s approximation to the log-likelihood uses this quadratic approximation to the penalized deviance, which is negative one-half the logarithm of the integrand, to approximate the value of the integral.

On the deviance scale, which is negative twice the log-likelihood, the approximation is \[ \begin{aligned} -2\,\ell({\mathbf{u}}|{\mathbf{y}},{\boldsymbol{\theta}},{\boldsymbol{\beta}})&=-2\,\log\left(L({\mathbf{u}}|{\mathbf{y}},{\boldsymbol{\theta}},{\boldsymbol{\beta}})\right)\\ &=-2\,\log\left(\int_{{\mathbf{u}}}\frac{1}{\sqrt{2\pi}^q}\exp\left(\frac{\left\|{\mathbf{u}}\right\|^2+\sum_{i=1}^n d(y_i,\mu_i)}{-2}\right)\,d{\mathbf{u}}\right)\\ &\approx\tilde{d}({\mathbf{y}},\tilde{{\mathbf{u}}})-2\,\log\left( \int_{\mathbf{u}}\frac{1}{\sqrt{2\pi}^q}\exp\left(\frac{[{\mathbf{u}}-\tilde{{\mathbf{u}}}]'{\mathbf{L}}{\mathbf{L}}'[{\mathbf{u}}-\tilde{{\mathbf{u}}}]}{-2}\right)\,d{\mathbf{u}} \right)\\ &=\tilde{d}({\mathbf{y}},\tilde{{\mathbf{u}}})-2\,\log\left( \int_{\mathbf{u}}\frac{1}{\sqrt{2\pi}^q}\exp\left(\frac{\left\|{\mathbf{L}}'[{\mathbf{u}}-\tilde{{\mathbf{u}}}\right\|^2}{-2}\right)\,d{\mathbf{u}} \right)\\ &=\tilde{d}({\mathbf{y}},\tilde{{\mathbf{u}}})-2\,\log\left(|{\mathbf{L}}|^{-1}\right)\\ &=\tilde{d}({\mathbf{y}},\tilde{{\mathbf{u}}})+\log\left(|{\mathbf{L}}|^2\right) \end{aligned} \]

```
function laplaceapprox(m::BernoulliPIRLS)
return pdeviance(m) + 2 * sum(log, m.utbl.Ldiag)
end
```

`laplaceapprox(pirls!(m))`

`2373.518052752164`

The remaining step is to optimize Laplace’s approximation to the GLMM deviance with respect to \(\theta\) and \({\boldsymbol{\beta}}\), which we do using the BOBYQA optimizer from NLopt.jl

```
function StatsAPI.fit!(m::BernoulliPIRLS)
= m.θβ
θβ = length(θβ) # length(β) = p and length(θ) = 1
pp1 = Opt(:LN_BOBYQA, pp1)
opt = view(θβ, 2:pp1)
mβ function obj(x, g)
if !isempty(g)
throw(ArgumentError("gradient not provided, g must be empty"))
end
copyto!(θβ, x)
mul!(m.ytbl.offset, m.X, mβ)
return laplaceapprox(pirls!(m))
end
= obj
opt.min_objective = fill!(similar(θβ), -Inf) # vector of lower bounds
lb 1] = 0 # scalar θ must be non-negative
lb[lower_bounds!(opt, lb)
NLopt.= optimize(opt, copy(θβ))
minf, minx, ret @info (; ret, fevals=opt.numevals, minf)
return m
end
```

`fit!(m);`

`[ Info: (ret = :ROUNDOFF_LIMITED, fevals = 550, minf = 2354.4744815688055)`

```
print(
"Converged to θ = ",
first(m.θβ),
" and β =",
view(m.θβ, 2:lastindex(m.θβ)),
)
```

`Converged to θ = 0.5683043594028967 and β =[-0.3409777149845993, 0.3933796201906975, 0.6064857599227369, -0.012926172564277872, 0.03323478854784157, -0.005626184982660486]`

These estimates differ somewhat from those for model `com05`

.

```
print(
"Estimates for com05: θ = ",
only(com05.θ),
", fmin = ",
deviance(com05),
", and β =",
com05.β, )
```

`Estimates for com05: θ = 0.5761507901895634, fmin = 2353.8241980539815, and β =[-0.3414913998306781, 0.3936080536502067, 0.6064861079468472, -0.012911714642169572, 0.03321662487439253, -0.005625046845040066]`

The discrepancy in the results is because the `com05`

results are based on a more accurate approximation to the integral called *adaptive Gauss-Hermite Quadrature*, which is discussed in Section C.6.

There is an implicit assumption in the `BernoulliPIRLS`

structure that random effects in the model are simple, scalar random effects associated with a single grouping factor, which is represented by `m.ytbl.refs`

. For such models the random effects model matrix, \({\mathbf{Z}}\), is an \(n\times q\) indicator matrix, the covariance parameter, \({\boldsymbol{\theta}}\), is one-dimensional and the covariance factor, \({\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}=\theta\,{\mathbf{I}}_q\) is a scalar multiple of the \(q\times q\) identity matrix, \({\mathbf{I}}_q\). Furthermore, \({\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}'{\mathbf{Z}}'{\mathbf{W}}{\mathbf{Z}}{\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}+{\mathbf{I}}_q\) is also diagonal, as is its Cholesky factor, \({\mathbf{L}}\).

We have taken advantage of the special structure of these matrices both in representations — storing \({\mathbf{L}}\) by storing only the diagonal values in the vector `m.utbl.Ldiag`

— and in some algorithms for the PIRLS iterative step.

The PIRLS algorithm to determine the conditional mode, \(\tilde{{\mathbf{u}}}\), of the random effects, \({\mathcal{U}}\), and Laplace’s approximation to the log-likelihood for GLMMs can be generalized to models with vector-valued random effects, or with random effects associated with more than one grouping factor, or with both. The more general representation of \({\mathbf{Z}}\) and \({\mathbf{L}}\) used with linear mixed models can be adapted for GLMMs as well.

We chose to specialize the representation of GLMMs in this appendix to this specific type of random effects to be able to demonstrate adaptive Gauss-Hermite quadrature in Section C.6, which, at present, is restricted to models with a single, simple, scalar, random effects term.

Recall from Equation C.9 that Laplace’s approximation to the likelihood is based on a quadratic approximation to the penalized (GLM) deviance, \(\tilde{d}({\mathbf{y}},{\mathbf{u}})\), at the conditional mode \(\tilde{{\mathbf{u}}}\). In the case of a model with a single, scalar, random effects term, like the model `com05`

, each linear predictor value, \(\eta_i,\,i=1,\dots,n\) depends on only one element of \({\mathbf{u}}\). Writing the set of indices \(i\) for which `refs[i] == j`

as \({\mathcal{I}}(j)\), we can express the penalized deviance, and its quadratic approximation, as sums of scalar contributions, \[
\begin{aligned}
\tilde{d}({\mathbf{y}},{\mathbf{u}})&=\|{\mathbf{u}}\|^2+\sum_{i=1}^n d(y_i,\mu_i)\\
&=\sum_{j=1}^q\left(u_j^2+\sum_{i\in{\mathcal{I}}(j)}d(y_i,\mu_i)\right)\\
&\approx\sum_{j=1}^q \left(\tilde{u}_j^2+\sum_{i\in{\mathcal{I}}(j)}\left(d(y_i,\tilde{\mu}_i)+\ell_j^2(u_j-\tilde{u}_j)^2\right)\right)
\end{aligned}
\] where \(\ell_j\) is the \(j\)th diagonal element of \({\mathbf{L}}\) and \(\tilde{\mu}_i\) is the value of \(\mu_i\) when \(u_j=\tilde{u}_j\).

Extending the notation of Equation C.9 we write the contribution from \(u_j\) to the penalized (GLM) deviance, and to its quadratic approximation, as \[ \begin{aligned} \tilde{d}_j(u_j)&= u_j+\sum_{i\in{\mathcal{I}}(j)}d(y_i,\mu_i)\\ &\approx \tilde{u_j} + \sum_{i\in{\mathcal{I}}(j)}\left(d(y_i,\tilde{\mu}_i)+\ell_j^2(u_j-\tilde{u}_j)\right)\\ &=\tilde{d}_j(\tilde{u}_j)+\ell_j^2(u_j-\tilde{u}_j)\quad j=1,\dots,q , \end{aligned} \] giving Laplace’s approximation to the scalar integral defining the contribution of \(u_j\) to negative twice the log-likelihood as \[ \begin{aligned} -2\log\int_{u_j}\frac{e^{-\tilde{d}_j(u_j)/2}}{\sqrt{2\pi}}\,du_j&\approx -2\log\int_{u_j}\frac{e^{\left((-\tilde{d}_j(\tilde{u_j})-\ell_j^2(u_j-\tilde{u}_j)^2)/2\right)}}{\sqrt{2\pi}}\,du_j\\ &=\tilde{d}_j(\tilde{u}_j)-2\log\int_{u_j}\frac{e^{-\ell_j^2(u_j-\tilde{u}_j)/2}}{\sqrt{2\pi}}\,du_j\\ &=\tilde{d}_j(\tilde{u}_j)-2\log\int_{z_j}\frac{e^{-z_j^2/2}}{\sqrt{2\pi}}\,\frac{dz_j}{\ell_j}\\ &=\tilde{d}_j(\tilde{u}_j)+2\log(\ell_j)-2\log\int_{z_j}\frac{e^{-z_j^2/2}}{\sqrt{2\pi}}\,dz_j\\ &=\tilde{d}_j(\tilde{u}_j)+2\log(\ell_j)-2\log(1)\\ &=\tilde{d}_j(\tilde{u}_j)+2\log(\ell_j) \end{aligned} \tag{C.10}\] using the change of variable \[ z_j=\ell_j(u_j-\tilde{u}_j)\quad j=1,\dots,q \tag{C.11}\] with inverse \[ u_j=\frac{z_j}{\ell_j}+\tilde{u}_j\quad j=1,\dots,q \tag{C.12}\] and derivative \[ \frac{du_j}{dz_j}=\frac{1}{\ell_j} . \]

The change of variable Equation C.11 allows us to express the contribution from \(u_j\) to Laplace’s approximation as a constant, \(\tilde{d}_j(\tilde{u}_j)\), plus the integral of a multiple, \(1/\ell_j\), of the density of a standard normal distribution \[ \phi(z)=\frac{e^{-z^2/2}}{\sqrt{2\pi}} . \]

The \(K\)th-order normalized Gauss-Hermite quadrature rule allows us to extend this approach to evaluate integrals of the form \[ \int_z f(z)\frac{e^{-z^2/2}}{\sqrt{2\pi}}\, dz \approx \sum_{k=1}^K w_k f(z_k) \tag{C.13}\] where the weights, \(w_k,\,k=1,\dots,K\), and the absiccae, \(z_k,\,k=1,\dots,K\) are evaluated as described in Section C.6.1. The approximation Equation C.13 is exact when \(f(z)\) is a polynomial of order \(2K-1\) or less.

We will apply a rule like Equation C.13 where \(f\) is the exponential of negative half the difference between the penalized deviance, \(\tilde{d}_j(z_j/\ell_j+\tilde{u}_j)\), and its quadratic approximation at the conditional mode, \(\tilde{u}_j\). That is, we will center the standard normal density at the conditional mode, \(\tilde{u}_j,\,j=1,\dots,q\), and scale it by the inverse of the quadratic term, \(\ell_j,\,j=1,\dots,q\), in the quadratic approximation at that value of \(u\). This is said to be an *adaptive* quadrature rule because we are shifting and scaling the evaluation points according to the current conditions in the iterative algorithm.

In other words we will first apply PIRLS to determine the conditional modes and the quadratic terms, then use a normalized Gauss-Hermite quadrature rule.

*Gaussian Quadrature rules* provide sets of `x`

values, called *abscissae*, and corresponding weights, `w`

, to approximate an integral with respect to a *weight function*, \(g(x)\). For a \(K\)th order rule the approximation is \[
\int f(x)g(x)\,dx \approx \sum_{k=1}^K w_k f(x_k)
\]

For the Gauss-Hermite rule the weight function is \[
g(x) = e^{-x^2}
\] and the domain of integration is \((-\infty, \infty)\). A slight variation of this is the *normalized Gauss-Hermite* rule for which the weight function is the standard normal density \[
g(z) = \phi(z) = \frac{e^{-z^2/2}}{\sqrt{2\pi}} .
\]

Thus, the expected value of \(f(z)\), where \(\mathcal{Z}\sim\mathscr{N}(0,1)\), is approximated as \[ \mathbb{E}[f]=\int_{-\infty}^{\infty} f(z) \phi(z)\,dz\approx\sum_{k=1}^K w_k\,f(z_k) . \]

Naturally, there is a caveat. For the approximation to be accurate the function \(f(z)\) must behave like a low-order polynomial over the range of interest. More formally, a \(K\)th order rule is exact when \(f\) is a polynomial of order \(2K-1\) or less.

In the *Golub-Welsch algorithm* the abscissae for a particular Gaussian quadrature rule are determined as the eigenvalues of a symmetric tri-diagonal matrix and the weights are derived from the squares of the first row of the matrix of eigenvectors. For a \(K\)th order normalized Gauss-Hermite rule the tridiagonal matrix has zeros on the diagonal and the square roots of `1:k-1`

on the super- and sub-diagonal, e.g.

`= SymTridiagonal(zeros(5), sqrt.(1:4)) sym5 `

```
5×5 SymTridiagonal{Float64, Vector{Float64}}:
0.0 1.0 ⋅ ⋅ ⋅
1.0 0.0 1.41421 ⋅ ⋅
⋅ 1.41421 0.0 1.73205 ⋅
⋅ ⋅ 1.73205 0.0 2.0
⋅ ⋅ ⋅ 2.0 0.0
```

```
= eigen(sym5);
ev ev.values
```

```
5-element Vector{Float64}:
-2.8569700138728
-1.3556261799742608
1.3322676295501878e-15
1.3556261799742677
2.8569700138728056
```

`abs2.(ev.vectors[1, :])`

```
5-element Vector{Float64}:
0.011257411327720667
0.2220759220056134
0.5333333333333334
0.222075922005612
0.011257411327720648
```

A function of `k`

to evaluate the abscissae and weights is

```
function gausshermitenorm(k)
= eigen(SymTridiagonal(zeros(k), sqrt.(1:(k - 1))))
ev return Table((;
=ev.values,
abscissae=abs2.(ev.vectors[1, :]),
weights
))end
```

providing

`gausshermitenorm(5)`

```
Table with 2 columns and 5 rows:
abscissae weights
┌───────────────────────
1 │ -2.85697 0.0112574
2 │ -1.35563 0.222076
3 │ 1.33227e-15 0.533333
4 │ 1.35563 0.222076
5 │ 2.85697 0.0112574
```

The weights and positions for the 9th order rule are shown in Figure C.1.

```
draw(
data(gausshermitenorm(9)) *
mapping(:abscissae => "Positions", :weights);
=(; size=(600, 450)),
figure )
```

Notice that the magnitudes of the weights drop quite dramatically away from zero, even on a logarithmic scale (Figure C.2)

```
draw(
data(gausshermitenorm(9)) * mapping(
:abscissae => "Positions",
:weights => log2 => "log₂(weight)",
);=(; size=(600, 450)),
figure )
```

The definition of `MixedModels.GHnorm`

is similar to the `gausshermitenorm`

function with some extra provisions for ensuring symmetry of the abscissae and of the weights and for caching values once they have been calculated.

```
let tbl = GHnorm(9)
Table(abscissae=tbl.z, weights=tbl.w)
end
```

```
Table with 2 columns and 9 rows:
abscissae weights
┌──────────────────────
1 │ -4.51275 2.23458e-5
2 │ -3.20543 0.00278914
3 │ -2.07685 0.0499164
4 │ -1.02326 0.244098
5 │ 0.0 0.406349
6 │ 1.02326 0.244098
7 │ 2.07685 0.0499164
8 │ 3.20543 0.00278914
9 │ 4.51275 2.23458e-5
```

In particular, when \(K\) is odd the middle abscissa, at index \((K+1)/2\), is exactly zero.

As an example of evaluation using these weights and abscissae, we consider \[ \mathbb{E}[g(x)] \approx \sum_{i=1}^k g(\mu + \sigma z_i)\,w_i \] where \(\mathcal{X}\sim\mathscr{N}(\mu, \sigma^2)\).

For example, \(\mathbb{E}[\mathcal{X}^2]\) where \(\mathcal{X}\sim\mathcal{N}(2, 3^2)\) is

```
let μ = 2, σ = 3, ghn3 = GHnorm(3)
sum(@. ghn3.w * abs2(μ + σ * ghn3.z)) # should be μ² + σ² = 13
end
```

`13.0`

(In general a dot, ‘`.`

’, after the function name in a function call, as in `abs2.(...)`

, or before an operator creates a *fused vectorized* evaluation in Julia. The macro `@.`

has the effect of vectorizing all operations in the subsequent expression.)

We have provided a `pdev`

vector in the `utbl`

field of a `BernoulliPIRLS`

object to allow for accumulation of the penalized deviance contributions for each component of \({\mathbf{u}}\) in the model.

```
function pdevcomps!(m::BernoulliPIRLS)
= m.utbl
(; u, pdev) .= abs2.(u) # initialize pdevj to square of uj
pdev @inbounds for (ri, ti) in zip(m.ytbl.refs, m.rtbl)
+= ti.dev
pdev[ri] end
return m
end
```

After PIRLS has converged, we evaluate `pdevcomps!`

and copy the `pdev`

column of the `utbl`

field to its `pdev0`

column, which will be the baseline evaluation of the penalized deviance at the conditional modes. Other evaluations of the penalized deviance components are plotted as differences from `pdev0`

.

```
pdevcomps!(pirls!(m))
copyto!(m.utbl.pdev0, m.utbl.pdev)
Table(m.utbl)
```

```
Table with 6 columns and 102 rows:
u u0 Ldiag pdev pdev0 aGHQ
┌────────────────────────────────────────────────────────
1 │ -1.02424 -1.02424 2.3596 78.2322 78.2322 0.0
2 │ -1.6554 -1.6554 1.87894 41.917 41.917 0.0
3 │ -0.0432085 -0.0432085 1.54253 21.8681 21.8681 0.0
4 │ 0.500796 0.500796 1.07154 2.68642 2.68642 0.0
5 │ 1.10928 1.10928 1.29471 8.58305 8.58305 0.0
6 │ -0.415844 -0.415844 1.49343 18.6901 18.6901 0.0
7 │ 0.030515 0.030515 1.06624 1.46897 1.46897 0.0
8 │ 0.216409 0.216409 1.87125 46.1746 46.1746 0.0
9 │ 0.433361 0.433361 1.2297 9.7717 9.7717 0.0
10 │ -0.648279 -0.648279 2.10967 58.417 58.417 0.0
11 │ -0.328134 -0.328134 1.47521 23.5301 23.5301 0.0
12 │ 0.499976 0.499976 1.06882 2.74129 2.74129 0.0
13 │ 0.139717 0.139717 1.82727 43.6776 43.6776 0.0
14 │ 0.176548 0.176548 1.10663 2.77503 2.77503 0.0
15 │ -0.471723 -0.471723 1.51184 21.2797 21.2797 0.0
16 │ -0.757145 -0.757145 1.25035 7.09335 7.09335 0.0
17 │ -1.59799 -1.59799 1.32299 8.74912 8.74912 0.0
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
```

Consider the change in the penalized deviance from that at the conditional mode for a selection of groups, which, by default, we choose to be the first 5 groups.

```
function pdevdiff(
::BernoulliPIRLS{T};
m=collect(-3.5:inv(32):3.5),
zvals=1:5,
indswhere {T}
) = m.utbl
(; u, u0, Ldiag, pdev, pdev0) pdevcomps!(pirls!(m)) # assign u0
copyto!(pdev0, pdev) # and pdev0
= length(inds)
ni = length(zvals)
nz = Array{T}(undef, ni, nz)
uvals = Array{T}(undef, ni, nz)
exact for (j, z) in enumerate(zvals)
.= u0 .+ z ./ Ldiag # evaluate u from z
u pdevcomps!(updatetbl!(m)) # evaluate pdev
for (i, ind) in enumerate(inds) # store selected u and pdev
= u[ind]
uvals[i, j] = pdev[ind] - pdev0[ind]
exact[i, j] end
end
= collect(uvals') # transpose uvals
uvals = collect(exact') # and exact
exact return (; zvals, inds, uvals, exact)
end
= pdevdiff(m); m05pdevdiff
```

We begin with plots of the difference in the penalized deviance from its value at the conditional mode, \(\tilde{d}_j(u_j)-\tilde{d_j}(\tilde{u}_j)\), for \(j=1,\dots,5\) in Figure C.3

```
let (; zvals, inds, uvals, exact) = m05pdevdiff,
= Figure(; size=(600, 375)),
fig = Axis(
ax 1, 1];
fig[="u",
xlabel="Change in penalized deviance",
ylabel
)
= [
lins lines!(ax, view(uvals, :, j), view(exact, :, j)) for
in axes(uvals, 2)
j
]Legend(fig[1, 2], lins, string.(inds))
figend
```

then shift and scale the horizontal axis to the \(z\) scale for this difference in the penalized deviance (from that at the conditional mode) in Figure C.4.

```
let (; zvals, inds, uvals, exact) = m05pdevdiff,
= Figure(; size=(600, 375)),
fig = Axis(
ax 1, 1];
fig[="z",
xlabel="Change in penalized deviance",
ylabel
)
=
lins lines!(ax, zvals, view(exact, :, j)) for j in axes(uvals, 2)]
[Legend(fig[1, 2], lins, string.(inds))
figend
```

The next stage is to plot, on the \(z\) scale, the difference between the penalized deviance and its quadratic approximation at \(z=0\) in Figure C.5

```
let (; zvals, inds, uvals, exact) = m05pdevdiff,
= Figure(; size=(600, 375)),
fig = Axis(
ax 1, 1];
fig[="z",
xlabel="Penalized deviance minus quadratic approx",
ylabel
)
= [
lins lines!(ax, zvals, view(exact, :, j) .- abs2.(zvals)) for
in axes(uvals, 2)
j
]Legend(fig[1, 2], lins, string.(inds))
figend
```

and, finally, the exponential of negative half of this difference in Figure C.6

```
let (; zvals, inds, uvals, exact) = m05pdevdiff,
= Figure(; size=(600, 375)),
fig = Axis(
ax 1, 1];
fig[="z",
xlabel="Exp of half the quadratic minus penalized deviance",
ylabel
)
= [
lins lines!(
ax,
zvals,exp.(0.5 .* (abs2.(zvals) .- view(exact, :, j))),
in axes(uvals, 2)
) for j
]Legend(fig[1, 2], lins, string.(inds))
figend
```

Writing the function shown in Figure C.6 — the exponential of negative half the difference between the penalized deviance and its quadratic approximation — as \[ f_j(z_j)=\exp{\left(\frac{\tilde{d}_j(z_j/\ell_j+\tilde{u}_j)-\tilde{d}_j(\tilde{u}_j)-z_j^2}{-2}\right)} \quad j=1,\dots,q \] we can modify the scalar version of Laplace’s approximation, Equation C.10, as \[ \begin{multline*} -2\log\int_{u_j}\frac{e^{-\tilde{d}_j(u_j)/2}}{\sqrt{2\pi}}\,du_j\\ =-2\log\int_{z_j}\frac{f_j(z_j)\,e^{-(\tilde{d}_j(\tilde{u}_j)+z_j^2)/2}}{\sqrt{2\pi}}\,\frac{dz_j}{\ell_j}\\ =\tilde{d}_j(\tilde{u}_j)+2\log(\ell_j)-2\log\int_{z_j}f_j(z_j)\frac{e^{-z_j^2/2}}{\sqrt{2\pi}}\,dz_j \end{multline*} \tag{C.14}\]

A \(K\)th-order adaptive Gauss-Hermite quadrature approximation to the objective, negative twice the log-likelihood, for the GLMM model is Laplace’s approximation minus twice the logarithm of the \(K\)th order normalized Gauss-Hermite quadrature rule applied to \(f_j(z_j)\). We use the `aGHQ`

column of `m.utbl`

to accumulate these contributions and to take the logarithm then multiply the result by -2.

```
function evalGHQ!(m::BernoulliPIRLS; nGHQ::Integer=9)
= m
(; ytbl, utbl, rtbl) = utbl
(; u, u0, Ldiag, pdev, pdev0, aGHQ) = GHnorm(nGHQ)
ghqtbl pdevcomps!(pirls!(m)) # ensure that u0 and pdev0 are current
copyto!(pdev0, pdev)
fill!(aGHQ, 0)
for (z, w) in zip(ghqtbl.z, ghqtbl.w)
if iszero(z) # exp term is one when z == 0
.+= w
aGHQ else
.= u0 .+ z ./ Ldiag
u pdevcomps!(updatetbl!(m))
.+= w .* exp.((abs2(z) .+ pdev0 .- pdev) ./ 2)
aGHQ end
end
map!(log, aGHQ, aGHQ) # log.(aGHQ) in place
.*= -2
aGHQ return m
end
```

```
evalGHQ!(m)
Table(m.utbl)
```

```
Table with 6 columns and 102 rows:
u u0 Ldiag pdev pdev0 aGHQ
┌─────────────────────────────────────────────────────────────
1 │ 0.888261 -1.02424 2.3596 98.9905 78.2322 -0.00503286
2 │ 0.746346 -1.6554 1.87894 65.9282 41.917 -0.00602822
3 │ 2.88235 -0.0432085 1.54253 42.7156 21.8681 -0.0077699
4 │ 4.71226 0.500796 1.07154 22.5154 2.68642 -0.00332321
5 │ 4.5948 1.10928 1.29471 26.5466 8.58305 -0.00560788
6 │ 2.60588 -0.415844 1.49343 40.3098 18.6901 -0.00726535
7 │ 4.26289 0.030515 1.06624 21.5982 1.46897 -0.00232427
8 │ 2.62803 0.216409 1.87125 67.5008 46.1746 -0.00639679
9 │ 4.10316 0.433361 1.2297 28.6561 9.7717 -0.00680155
10 │ 1.4908 -0.648279 2.10967 80.9664 58.417 -0.00583025
11 │ 2.73092 -0.328134 1.47521 44.9692 23.5301 -0.00736271
12 │ 4.72216 0.499976 1.06882 22.6241 2.74129 -0.00283963
13 │ 2.60939 0.139717 1.82727 64.9152 43.6776 -0.00681333
14 │ 4.25447 0.176548 1.10663 22.371 2.77503 -0.00457801
15 │ 2.51322 -0.471723 1.51184 43.0725 21.2797 -0.00742932
16 │ 2.85204 -0.757145 1.25035 29.6883 7.09335 -0.00309483
17 │ 1.81302 -1.59799 1.32299 32.9677 8.74912 -0.00213836
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
```

`extrema(m.utbl.aGHQ)`

`(-0.008514108854172236, -0.0004132698576107887)`

As we see, these “correction terms” relative to Laplace’s approximation are relatively small, compared to the contributions to the objective from each component of \({\mathbf{u}}\). Also, the corrections are all negative, in this case. Close examination of the individual curves in Figure C.5 shows that these curves, which are \(-2\log(f_j(z))\), are more-or-less odd functions, in the sense that the value at \(-z\) is approximately the negative of the value at \(z\). If we were integrating \(\log(f_j(z_j))\phi(z_j)\) with a normalized Gauss-Hermite rule the negative and positive values would cancel out, for the most part, and some of the integrals would be positive while others would be negative.

When we consider \(f_j(z_j)\), shown in Figure C.6, the exponential function converts from differences to ratios and stretches positive differences more than negative differences, resulting in values slightly greater than 1 for \(\int_z f_j(z)\phi(z) dz\) and, after taking negative twice the logarithm, correction terms that are slightly less than zero.

```
function fitGHQ!(m::BernoulliPIRLS; nGHQ::Integer=9)
= m.utbl
(; Ldiag, pdev0, aGHQ) = m.θβ
θβ = length(θβ) # length(β) = p and length(θ) = 1
pp1 = Opt(:LN_BOBYQA, pp1)
opt = view(θβ, 2:pp1)
mβ function obj(x, g)
if !isempty(g)
throw(ArgumentError("gradient not provided, g must be empty"))
end
copyto!(θβ, x)
mul!(m.ytbl.offset, m.X, mβ)
evalGHQ!(m; nGHQ)
return sum(pdev0) + sum(aGHQ) + 2 * sum(log, Ldiag)
end
= obj
opt.min_objective = fill!(similar(θβ), -Inf) # vector of lower bounds
lb 1] = 0 # scalar θ must be non-negative
lb[lower_bounds!(opt, lb)
NLopt.= optimize(opt, copy(θβ))
minf, minx, ret @info (; ret, fevals=opt.numevals, minf)
return m
end
```

```
fitGHQ!(m)
m.θβ
```

```
┌ Warning: PIRLS iteration did not reduce penalized deviance
└ @ Main.Notebook ~/Work/EmbraceUncertainty/aGHQ.qmd:1064
┌ Warning: PIRLS iteration did not reduce penalized deviance
└ @ Main.Notebook ~/Work/EmbraceUncertainty/aGHQ.qmd:1064
[ Info: (ret = :ROUNDOFF_LIMITED, fevals = 479, minf = 2353.82419755322)
```

```
7-element Vector{Float64}:
0.5761321679271924
-0.3414655990254175
0.39359939391066806
0.6064447618771712
-0.012909685721680265
0.03320994962034241
-0.005624606329593786
```

*This page was rendered from git revision 05a171b
.*