vignettes/fastglm-overview.Rmd
fastglm-overview.RmdThe fastglm package is intended as a fast
and stable alternative to
stats::glm() for fitting generalized linear models. It is
compatible with R’s family objects (see
?family) and produces fitted objects whose downstream
methods — summary(), vcov(),
predict(), coef(), residuals(),
logLik() — behave exactly like those for a
glm. This vignette walks through the main functionality of
the package at a higher level than the other vignettes; it is intended
as a single entry point for a new user.
The main package function is fastglm(). It takes a
numeric design matrix x, a response y, and an
R family object:
data(esoph)
x <- model.matrix(~ agegp + unclass(tobgp) + unclass(alcgp),
data = esoph)
y <- cbind(esoph$ncases, esoph$ncontrols)
fit <- fastglm(x, y, family = binomial(link = "cloglog"))
summary(fit)
#>
#> Call:
#> fastglm.default(x = x, y = y, family = binomial(link = "cloglog"))
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -4.29199 0.29777 -14.414 < 2e-16 ***
#> agegp.L 3.30677 0.63454 5.211 1.88e-07 ***
#> agegp.Q -1.35525 0.57310 -2.365 0.018 *
#> agegp.C 0.20296 0.43333 0.468 0.640
#> agegp^4 0.15056 0.29318 0.514 0.608
#> agegp^5 -0.23425 0.17855 -1.312 0.190
#> unclass(tobgp) 0.33276 0.07285 4.568 4.93e-06 ***
#> unclass(alcgp) 0.80384 0.07538 10.664 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 61.609 on 87 degrees of freedom
#> Residual deviance: 96.950 on 80 degrees of freedom
#> AIC: 228
#>
#> Number of Fisher Scoring iterations: 6fastglm() does not accept a formula directly, it
operates on a pre-built design matrix. To use a formula and a data
frame, pass fastglm_fit as the fitting function to base
glm():
The IRLS algorithm reduces every iteration to a weighted least-squares problem. fastglm supports six different matrix decompositions for solving that WLS step, all from RcppEigen (Bates and Eddelbuettel, 2013); the choice trades off speed against numerical stability and rank-revealing behaviour:
method |
decomposition |
|---|---|
0 |
column-pivoted Householder QR (default; rank-revealing) |
1 |
unpivoted Householder QR |
2 |
LLT Cholesky |
3 |
LDLT Cholesky |
4 |
full-pivoted Householder QR |
5 |
bidiagonal divide-and-conquer SVD |
The default (method = 0) is the safe choice: it is
rank-revealing, so it handles aliased / collinear columns gracefully.
The Cholesky methods (2 and 3) are roughly
3–4× faster but assume full column rank.
set.seed(123)
n <- 5000; p <- 30
x <- matrix(rnorm(n * p), n, p)
y <- rbinom(n, 1, plogis(x %*% rnorm(p) * 0.05))
system.time(f0 <- fastglm(x, y, family = binomial())) # default
#> user system elapsed
#> 0.004 0.000 0.004
system.time(f2 <- fastglm(x, y, family = binomial(), method = 2)) # LLT
#> user system elapsed
#> 0.002 0.000 0.002
system.time(f3 <- fastglm(x, y, family = binomial(), method = 3)) # LDLT
#> user system elapsed
#> 0.002 0.000 0.002fastglm does not compromise computational stability for
speed. It uses a step-halving safeguard following Marschner (2011) and
starts from better-initialised values than glm() or
glm2::glm2(), so it tends to converge in cases where the
standard IRLS algorithm fails. See
vignette("fastglm", package = "fastglm") for a
Gamma-with-sqrt-link example where glm() does
not converge but fastglm() does.
vcov(), robust SE, predictions
The fitted object stores the unscaled covariance directly (no
refitting), and the standard vcov() /
summary() machinery works as expected:
fit <- fastglm(x, y, family = binomial(), method = 2)
V <- vcov(fit)
dim(V)
#> [1] 30 30
## standard errors agree with the diagonal of vcov()
all.equal(sqrt(diag(V)), unname(coef(summary(fit))[, "Std. Error"]))
#> [1] "names for target but not for current"Heteroskedasticity-consistent and cluster-robust covariance matrices
are available via sandwich::vcovHC() and
sandwich::vcovCL() — fastglm registers methods
on those generics, so loading sandwich is all that is
required:
library(sandwich)
V_hc <- vcovHC(fit, type = "HC0")
V_hc[1:3, 1:3]
#> X1 X2 X3
#> X1 8.546454e-04 7.509395e-06 1.483571e-06
#> X2 7.509395e-06 8.321108e-04 7.951566e-07
#> X3 1.483571e-06 7.951566e-07 8.351855e-04
cluster <- rep(seq_len(20), length.out = n)
V_cl <- vcovCL(fit, cluster = cluster, type = "HC1")
V_cl[1:3, 1:3]
#> X1 X2 X3
#> X1 0.0007676800 -3.021035e-04 1.517204e-04
#> X2 -0.0003021035 8.060696e-04 8.492543e-05
#> X3 0.0001517204 8.492543e-05 7.823621e-04Results are numerically identical to sandwich::vcovHC()
and sandwich::vcovCL() applied to a glm fit
(Zeileis, Köll, and Graham, 2020) to floating-point precision. They work
for sparse and big.matrix fits as well, since the full
design matrix is preserved on the fitted object (as a reference, not
copied).
predict() supports se.fit = TRUE:
xnew <- x[1:5, , drop = FALSE]
predict(fit, newdata = xnew, type = "response", se.fit = TRUE)
#> $fit
#> [1] 0.6087584 0.5272820 0.4245913 0.5494426 0.6192178
#>
#> $se.fit
#> [1] 0.03919065 0.03095856 0.03623790 0.03629691 0.03512202
#>
#> $residual.scale
#> [1] 1For designs that are sparse, that live on disc, or that have to be built from a parquet / arrow / DuckDB source, fastglm provides three large-data paths that share a common streaming kernel and produce identical results:
Matrix::dgCMatrix: pass directly to
fastglm(). Useful for one-hot encoded categoricals and
high-dimensional sparse designs.bigmemory::big.matrix: pass directly to
fastglm(). The matrix is read in row-blocks and never fully
materialised in memory.fastglm_streaming(chunk_callback, n_chunks, family): a
user-supplied closure yields one row-block per call. The right path for
fitting on a parquet dataset, DuckDB query, or any external
columnar store.See vignette("large-data-fastglm", package = "fastglm")
for a detailed walk through all three paths. A short example:
n <- 4000
X <- cbind(1, matrix(rnorm(n * 3), n, 3))
y <- rbinom(n, 1, plogis(X %*% c(0.2, 0.4, -0.2, 0.3)))
chunk_size <- 1000
chunks <- function(k) {
idx <- ((k - 1) * chunk_size + 1):(k * chunk_size)
list(X = X[idx, , drop = FALSE], y = y[idx])
}
fit_stream <- fastglm_streaming(chunks, n_chunks = 4, family = binomial())
fit_full <- fastglm(X, y, family = binomial(), method = 2)
max(abs(coef(fit_stream) - coef(fit_full)))
#> [1] 1.111683e-07For the most commonly-used family/link
combinations, fastglm dispatches variance(),
mu.eta(), linkinv(), and
dev.resids() to inline C++ implementations rather than
calling back into R once per IRLS iteration. The covered
combinations are:
Detection is automatic, if the family object matches one
of the above, the native path is used; otherwise fastglm falls
back to the R-callback path used in earlier versions. The
native path is meaningfully faster on large n because it
eliminates the per-iteration round-trip to R for each of
the four family functions:
library(microbenchmark)
set.seed(7)
n <- 50000; p <- 30
x <- matrix(rnorm(n * p), n, p)
y <- rpois(n, exp(x %*% rnorm(p) * 0.05 + 1))
## force the R-callback path by giving the family object an unrecognised name
disguise <- function(fam) { fam$family <- paste0(fam$family, "*"); fam }
mb <- microbenchmark(
native = fastglm(x, y, family = poisson(), method = 2),
callback = fastglm(x, y, family = disguise(poisson()), method = 2),
times = 5L
)
print(mb)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> native 16.44772 16.89647 17.05270 17.11078 17.33648 17.47207 5 a
#> callback 22.69264 24.28869 24.27382 24.47671 24.71710 25.19397 5 bThe fastglm package switches transparently between the two paths; user code does not change.
A handful of regressions that come up often in practice are not standard GLMs but build on the same IRLS machinery. fastglm provides dedicated entry points for these:
fastglm_nb() — negative-binomial
regression with the dispersion theta estimated jointly with
beta. Plays the same role as MASS::glm.nb()
but runs the joint (beta, theta) MLE entirely in C++ (IRLS
for beta, Brent for theta).firth = TRUE — a flag on
fastglm() / fastglm_fit() that activates
Firth’s bias-reducing penalty on the score. Useful for binary logistic
regression under separation or in small samples; analogous to
logistf::logistf().fastglm_hurdle() — a two-part count
model: a binary regression for whether y > 0, plus a
zero-truncated Poisson or NB regression on the positive subset. The two
parts factorize, and both are fit by the same C++ IRLS solver. Same
model as pscl::hurdle().fastglm_zi() — a zero-inflated Poisson
or NB regression: a binary inflation component overlaid on the original
count distribution, fit by an EM algorithm in C++ with closed-form
posterior responsibilities and an analytical observed-information
vcov. Same model as pscl::zeroinfl().All four take the standard fastglm tuning options
(method, tol, maxit) and return
objects with the usual coef(), vcov(),
predict(), logLik() methods.
fastglm_hurdle() and fastglm_zi() use the
Formula package’s two-RHS syntax
y ~ x1 + x2 | z1 + z2 to specify different designs for the
count and zero parts. See
vignette("count-firth-fastglm", package = "fastglm") for
examples and timing comparisons against the canonical reference
packages.
There are three R-side functions to fit a GLM with
fastglm. They all call into the same C++ solver, but differ in
how much of base R’s machinery they wrap around it:
fastglm() is the top-level S3
function. It returns an object of class "fastglm" with
deviance, AIC, null deviance, residuals, etc. populated — the right
entry point for interactive use.fastglm_fit() is designed to be passed
as method = fastglm_fit to base glm(). The
returned object inherits from "fastglmFit" so that
glm()’s formula/data handling, update(),
predict(), etc. all work on top.fastglmPure() is the lowest-level
wrapper: no dispersion, no AIC, no null-deviance computation. Use this
when calling fastglm from another package and you only need the
coefficients.Marschner, I. C. (2011). glm2: Fitting generalized linear models with convergence problems. The R Journal, 3(2), 12–15. doi:10.32614/RJ-2011-012
Bates, D. and Eddelbuettel, D. (2013). Fast and elegant numerical linear algebra using the RcppEigen package. Journal of Statistical Software, 52(5), 1–24. doi:10.18637/jss.v052.i05
Zeileis, A., Köll, S., and Graham, N. (2020). Various versatile variances: An object-oriented implementation of clustered covariances in R. Journal of Statistical Software, 95(1), 1–36. doi:10.18637/jss.v095.i01