vignettes/fastglm.Rmd
fastglm.RmdThe fastglm package is intended to be a
fast and stable alternative
to the glm() and glm2::glm2() functions for
fitting generalized linear models. fastglm is compatible with
R’s family objects (see ?family).
The package can be installed via
pak::pak("fastglm")
# pak::pak("jaredhuling/fastglm") #development versionand loaded via:
The main package function is fastglm(). Currently,
fastglm() does not allow for formula-based data input and
is restricted to matrices. We use the following example to demonstrate
the usage of fastglm():
data(esoph)
x <- model.matrix(~ agegp + unclass(tobgp) + unclass(alcgp),
data = esoph)
y <- cbind(esoph$ncases, esoph$ncontrols)
gfit1 <- fastglm(x = x, y = y, family = binomial(link = "cloglog"))
summary(gfit1)
#>
#> Call:
#> fastglm.default(x = x, y = y, family = binomial(link = "cloglog"))
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -4.29199 0.29777 -14.414 < 2e-16 ***
#> agegp.L 3.30677 0.63454 5.211 1.88e-07 ***
#> agegp.Q -1.35525 0.57310 -2.365 0.018 *
#> agegp.C 0.20296 0.43333 0.468 0.640
#> agegp^4 0.15056 0.29318 0.514 0.608
#> agegp^5 -0.23425 0.17855 -1.312 0.190
#> unclass(tobgp) 0.33276 0.07285 4.568 4.93e-06 ***
#> unclass(alcgp) 0.80384 0.07538 10.664 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 61.609 on 87 degrees of freedom
#> Residual deviance: 96.950 on 80 degrees of freedom
#> AIC: 228
#>
#> Number of Fisher Scoring iterations: 6Alternatively, to use the built-in function of glm() for
processing formulas and data frames, we can use glm() with
method = fastglm.fit to call the fastglm fitting
functionality:
gfit2 <- glm(cbind(ncases, ncontrols) ~ agegp + unclass(tobgp) +
unclass(alcgp), data = esoph,
family = binomial(link = "cloglog"),
method = fastglm.fit)
summary(gfit2, refit = FALSE)
#>
#> Call:
#> glm(formula = cbind(ncases, ncontrols) ~ agegp + unclass(tobgp) +
#> unclass(alcgp), family = binomial(link = "cloglog"), data = esoph,
#> method = fastglm.fit)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -4.29199 0.29777 -14.414 < 2e-16 ***
#> agegp.L 3.30677 0.63454 5.211 1.88e-07 ***
#> agegp.Q -1.35525 0.57310 -2.365 0.018 *
#> agegp.C 0.20296 0.43333 0.468 0.640
#> agegp^4 0.15056 0.29318 0.514 0.608
#> agegp^5 -0.23425 0.17855 -1.312 0.190
#> unclass(tobgp) 0.33276 0.07285 4.568 4.93e-06 ***
#> unclass(alcgp) 0.80384 0.07538 10.664 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be NULL)
#>
#> Null deviance: 367.95 on 87 degrees of freedom
#> Residual deviance: 96.95 on 80 degrees of freedom
#> AIC: 1099.6
#>
#> Number of Fisher Scoring iterations: 6Setting refit = FALSE in summary() uses the
standard errors computed by the internal fitting functions of
fastglm. The default, refit = TRUE, re-fits the
model using glm() with method = "glm.fit" and
one IRLS iteration with the coefficients computed previously as the
starting values.
fastglm does not compromise computational stability for
speed. In fact, for many situations where glm() and even
glm2::glm2() do not converge, fastglm() does
converge.
As an example, consider the following data scenario, where the
response distribution is (mildly) misspecified, but the link function is
quite badly misspecified. In such scenarios, the standard IRLS algorithm
tends to have convergence issues. The glm2 package was designed
to handle such cases, however, it still can have convergence issues.
fastglm uses a similar step-halving technique as
glm2::glm2(), but it starts at better initialized values
and thus tends to have better convergence properties in practice.
set.seed(1)
x <- matrix(rnorm(10000 * 100), ncol = 100)
y <- exp(0.25 * x[,1] - 0.25 * x[,3] + 0.5 * x[,4] - 0.5 * x[,5] + rnorm(10000)) + 0.1
# fastglm
system.time(gfit1 <- glm(y ~ x, family = Gamma(link = "sqrt"),
method = fastglm.fit))
#> user system elapsed
#> 0.242 0.006 0.248
# glm
system.time(gfit2 <- glm(y ~ x, family = Gamma(link = "sqrt")))
#> user system elapsed
#> 1.398 0.028 1.427
# glm2
system.time(gfit3 <- glm(y ~ x, family = Gamma(link = "sqrt"),
method = glm2::glm.fit2))
#> user system elapsed
#> 1.022 0.024 1.046
## Note that fastglm() returns estimates with the
## largest likelihood
logLik(gfit1)
#> 'log Lik.' -16030.81 (df=102)
logLik(gfit2)
#> 'log Lik.' -16704.05 (df=102)
logLik(gfit3)
#> 'log Lik.' -16046.66 (df=102)
coef(gfit1)[1:5]
#> (Intercept) x1 x2 x3 x4
#> 1.429259574 0.125871413 0.005311209 -0.129396035 0.238950809
coef(gfit2)[1:5]
#> (Intercept) x1 x2 x3 x4
#> 1.431168e+00 1.251936e-01 -6.896739e-05 -1.281857e-01 2.366473e-01
coef(gfit3)[1:5]
#> (Intercept) x1 x2 x3 x4
#> 1.426864e+00 1.242616e-01 -9.860241e-05 -1.254873e-01 2.361301e-01
## check convergence of fastglm
gfit1$converged
#> [1] TRUE
## number of IRLS iterations
gfit1$iter
#> [1] 15
## now check convergence for glm()
gfit2$converged
#> [1] FALSE
gfit2$iter
#> [1] 25
## check convergence for glm2()
gfit3$converged
#> [1] TRUE
gfit3$iter
#> [1] 19
## increasing number of IRLS iterations for glm() does not help that much
system.time(gfit2 <- glm(y ~ x, family = Gamma(link = "sqrt"),
control = list(maxit = 100)))
#> user system elapsed
#> 5.258 0.122 5.395
gfit2$converged
#> [1] FALSE
gfit2$iter
#> [1] 100
logLik(gfit1)
#> 'log Lik.' -16030.81 (df=102)
logLik(gfit2)
#> 'log Lik.' -16068.27 (df=102)