Zero-inflated Poisson / Negative-Binomial regression

Fits a zero-inflated count model: a binary inflation component combined with a Poisson or NB count component, where the count component is the *original* (non-truncated) distribution. Excess zeros above what the count model implies are absorbed by the inflation component.

fastglm_zi(
  formula,
  data,
  subset,
  na.action,
  weights,
  offset,
  dist = c("poisson", "negbin"),
  link = c("logit", "probit", "cloglog", "log"),
  init.theta = NULL,
  em.tol = 1e-08,
  em.maxit = 100L,
  tol = 1e-09,
  maxit = 100L,
  theta.tol = 1e-08,
  theta.maxit = 100L,
  model = TRUE,
  y = TRUE,
  x = FALSE
)

Arguments

formula: a [Formula::Formula] of the form `y ~ x1 + x2 | z1 + z2` where the right-hand side after `|` specifies the inflation design. If `|` is absent, the same RHS is used for both parts.
data: optional data frame / environment.
subset, na.action: standard model-frame arguments.
weights: optional non-negative prior weights.
offset: optional vector or matrix. If a 2-column matrix, columns are taken as `(count_offset, zero_offset)`.
dist: count component distribution: `"poisson"` or `"negbin"`.
link: character, link for the inflation binomial. One of `"logit"` (default), `"probit"`, `"cloglog"`, `"log"`.
init.theta: optional positive scalar starting value for the NB dispersion. If `NULL`, a method-of-moments pilot is used.
em.tol, em.maxit: EM convergence tolerance and iteration cap.
tol, maxit: IRLS-loop convergence tolerance and iteration cap (used inside each M-step).
theta.tol, theta.maxit: Brent-loop tolerance / iteration cap for the inner theta MLE (only used when `dist = "negbin"`).
model: logical; keep the model frame on the result.
y, x: logical; keep the response / design matrices on the result.

Value

A list of class `c("fastglm_zi", "fastglm")` with elements `coefficients` (a list with `count` and `zero`), `se` (likewise), `vcov` (full, including theta if NB), `loglik`, `theta`, `SE.theta`, `tau` (posterior P(Z=1|y)), `em_iter`, `converged`.

Details

Estimation uses an EM algorithm with all numerical loops in C++. The E-step computes the posterior `tau_i = P(Z_i = 1 | y_i)` analytically; M-steps fit the inflation logit/probit/cloglog/log and the Poisson/NB count regression via the existing native fastglm IRLS (with weights `w_i (1 - tau_i)` on the count side). For NB, an inner Brent MLE re-estimates `theta` after the count-side beta step. Final vcov comes from the numerical Jacobian of the analytical observed score at the EM fixed point (block-structured for `(gamma, beta, theta)`).

This is the fastglm analogue of [pscl::zeroinfl()] – coefficients agree to 1e-5 on standard datasets (slightly looser than hurdle to allow for the EM iteration), with all numerical loops in C++.

Examples

set.seed(1)
n <- 400
x  <- rnorm(n)
eta_count <- 0.8 + 0.4*x
lam <- exp(eta_count)
eta_zero <- -0.6 + 0.5*x
p_inflate <- plogis(eta_zero)
z <- rbinom(n, 1, p_inflate)
y <- ifelse(z == 1, 0, rpois(n, lam))
df <- data.frame(y = y, x = x)
fit <- fastglm_zi(y ~ x, data = df, dist = "poisson")
coef(fit, model = "count")
#> (Intercept)           x 
#>   0.8590558   0.3067513 
coef(fit, model = "zero")
#> (Intercept)           x 
#>  -0.5134892   0.3188130