Hurdle (two-part) Poisson / Negative-Binomial regression

Fits a hurdle model for count data with excess zeros: a binary (zero / non-zero) regression on the full sample combined with a zero- truncated count regression on the positive subset. The two parts have independent likelihoods and are estimated jointly. With `dist = "negbin"` the negative-binomial dispersion `theta` is estimated by an inner Brent-1D MLE inside the C++ driver, alternating with the count-side IRLS.

fastglm_hurdle(
  formula,
  data,
  subset,
  na.action,
  weights,
  offset,
  dist = c("poisson", "negbin"),
  zero.dist = "binomial",
  link = c("logit", "probit", "cloglog", "log"),
  init.theta = NULL,
  tol = 1e-08,
  maxit = 100L,
  outer.tol = 1e-07,
  outer.maxit = 50L,
  theta.tol = 1e-08,
  theta.maxit = 100L,
  model = TRUE,
  y = TRUE,
  x = FALSE
)

Arguments

formula: a [Formula::Formula] of the form `y ~ x1 + x2 | z1 + z2` where the right-hand side after `|` specifies the zero-model design. If `|` is absent, the same RHS is used for both parts.
data: optional data frame / environment from which to draw model matrix variables.
subset, na.action: standard model-frame arguments.
weights: optional non-negative prior weights.
offset: optional vector or matrix. If a 2-column matrix, columns are taken as `(count_offset, zero_offset)`.
dist: count component distribution: `"poisson"` or `"negbin"`.
zero.dist: zero/non-zero distribution; only `"binomial"` is currently supported.
link: character, link for the zero/non-zero binomial. One of `"logit"` (default), `"probit"`, `"cloglog"`, `"log"`.
init.theta: optional positive scalar starting value for the NB dispersion. If `NULL`, a method-of-moments initializer is used.
tol, maxit: IRLS-loop convergence tolerance and iteration cap.
outer.tol, outer.maxit: `(beta, theta)` outer-loop convergence tolerance and iteration cap (only used when `dist = "negbin"`).
theta.tol, theta.maxit: Brent-loop tolerance / iteration cap for the inner theta MLE (only used when `dist = "negbin"`).
model: logical; keep the model frame on the result.
y, x: logical; keep the response / design matrices on the result.

Value

A list of class `c("fastglm_hurdle", "fastglm")` with elements `coefficients` (a list with `count` and `zero`), `se` (likewise), `vcov` (block-diagonal), `loglik`, `loglik_count`, `loglik_zero`, `theta`, `SE.theta`, `iter_*`, and `converged`.

Details

This is the fastglm analogue of [pscl::hurdle()] – coefficients agree to 1e-6 on standard datasets, with all numerical loops in C++.

Examples

set.seed(1)
n <- 300
x1 <- rnorm(n); x2 <- rnorm(n)
eta_count <- 1.0 + 0.4*x1 - 0.2*x2
lam <- exp(eta_count)
y_count <- rpois(n, lam)
eta_zero  <- -0.5 + 0.6*x1
p_pos     <- plogis(eta_zero)
is_pos    <- rbinom(n, 1, p_pos)
# zero-truncated: resample any rare zero from positive Poisson
for (i in seq_len(n)) {
  while (y_count[i] == 0) y_count[i] <- rpois(1, lam[i])
}
y <- ifelse(is_pos == 1, y_count, 0)
df <- data.frame(y = y, x1 = x1, x2 = x2)
fit <- fastglm_hurdle(y ~ x1 + x2, data = df, dist = "poisson")
coef(fit, model = "count")
#> (Intercept)          x1          x2 
#>   0.9710285   0.3375237  -0.3219622 
coef(fit, model = "zero")
#> (Intercept)          x1          x2 
#> -0.49868804  0.57931690  0.03888585