Fit a GLM by streaming row-blocks of the design matrix

`fastglm_streaming()` runs an IRLS GLM fit where the design matrix is produced one chunk at a time by a user-supplied closure. It never holds more than a single chunk in memory at once, plus a `p x p` accumulator. This is the front-end for fitting on data sources too large to load completely into RAM (Arrow datasets, Parquet files, DuckDB query results, on-disk CSV streams, etc.).

fastglm_streaming(
  chunk_callback,
  n_chunks,
  family = gaussian(),
  start = NULL,
  method = 2L,
  tol = 1e-07,
  maxit = 100L
)

Arguments

chunk_callback

a function. Called as `chunk_callback(k)` for `k = 1, ..., n_chunks`. Must return a list with elements

`X`: an `n_k x p` numeric matrix (chunk of the design matrix).
`y`: numeric vector of length `n_k` (response).
`weights`: optional; numeric vector of length `n_k` of prior weights.
`offset`: optional; numeric vector of length `n_k` of offsets.

Every chunk must have the same number of columns, in the same order. The closure is called multiple times per IRLS iteration, so it should be reasonably cheap (e.g. an Arrow scanner that reads from columnar files).

n_chunks

integer; the number of chunks to iterate over.

family

a `family` object describing the error distribution and link. See [stats::family()].

start

optional length-`p` numeric vector of starting coefficients.

method

integer; `2` for LLT Cholesky (default) or `3` for LDLT Cholesky. QR / SVD methods are not supported in streaming mode.

tol

convergence tolerance on the relative change in deviance.

maxit

maximum number of IRLS iterations.

Value

A list with class `"fastglm"` containing the same elements as [fastglm()], including `coefficients`, `cov.unscaled`, `deviance`, `iter`, `converged`, etc. The design matrix is *not* attached, so `sandwich::vcovHC()` / `sandwich::vcovCL()` will require re-streaming.

Details

The IRLS loop and step-halving (Marschner 2011) run entirely in C++; the R closure is called only to *deliver* one chunk at a time. For tier-1 families (gaussian / binomial / poisson / Gamma / inverse.gaussian on their common links) the family functions are evaluated inline in C++, so the only R round-trip per iteration is the chunk fetch.

Standard errors and `cov.unscaled` come from the final `(X' W X)` Cholesky factor, exactly as in the in-memory path.

Arrow / Parquet recipe

Wrap an Arrow scanner in a closure – pull each `RecordBatch` as a chunk, build the model matrix, and return it together with the response: `chunks(k)` opens the dataset, scans the `k`-th batch, calls `as.data.frame()`, then returns `list(X = model.matrix(~ x1 + x2, data = tbl), y = tbl$y)`. Pass that closure plus `n_chunks = length(batches)` to `fastglm_streaming()`.

The same recipe works for DuckDB (one `dbReadTable` per chunk), CSV streamers, and custom binary formats.

Examples

# Simulate a "data source" that yields the design matrix in 4 row-blocks.
set.seed(1)
n <- 1000; p <- 5
X <- cbind(1, matrix(rnorm(n * (p - 1)), n, p - 1))
y <- rbinom(n, 1, plogis(X %*% c(0.2, 0.5, -0.3, 0.4, -0.2)))
chunk_size <- 250
chunks <- function(k) {
  idx <- ((k - 1) * chunk_size + 1):(k * chunk_size)
  list(X = X[idx, , drop = FALSE], y = y[idx])
}

fit_stream <- fastglm_streaming(chunks, n_chunks = 4, family = binomial())
fit_full   <- fastglm(X, y, family = binomial(), method = 2)
max(abs(coef(fit_stream) - coef(fit_full)))
#> [1] 3.789857e-12