# Fast and Big Linear Model Fitting with bigmemory and RcppEigen

In a previous post, I went over the basics of linking up bigmemory and the eigen C++ library via RcppEigen. In this post I’ll take this a bit further by creating a version of the `fastLm()` function of RcppEigen that can accept bigmemory objects. By doing so, we will create a fast way to fit linear models using data which is too big to fit in RAM. With RcppEigen, fitting linear models using out-of-memory computation doesn’t have to be slow. The code for this is all on github in the bigFastlm package.

Before we even start, most of the work is already done as we’ll just need to change a few lines of the core C++ code of the `fastLm()` function so that it we can map the bigmemory pointer to data on disk to an eigen matrix object.

The core code of `fastLm` can be found here. The data object which is being loaded into C++ from R is mapped to an eigen matrix object at line 208 of `fastLm.cpp`.

``const Map<MatrixXd>  X(as<Map<MatrixXd> >(Xs));``

We need to change the above code to

``````XPtr<BigMatrix> bMPtr(Xs);

unsigned int typedata = bMPtr->matrix_type();

if (typedata != 8)
{
throw Rcpp::exception("type for provided big.matrix not available");
}

const Map<MatrixXd>  X = Map<MatrixXd>((double *)bMPtr->matrix(), bMPtr->nrow(), bMPtr->ncol()  );``````

The above modification first takes Xs as an Rcpp external pointer object (`XPtr`) and then checks to make sure it’s a double type (for now I’m ignoring all other data types (int, etc) for simplicity). Now that X is a mapped eigen matix object which points to data on disk, what else do we need to do? Well, not much! We just need to make sure that the correct object types are defined for the R-callable function. To do this, we need to change

``````// [[Rcpp::export]]
Rcpp::List fastLm_Impl(Rcpp::NumericMatrix X, Rcpp::NumericVector y, int type) {
return lmsol::fastLm(X, y, type);
}``````

To

``````// [[Rcpp::export]]
RcppExport SEXP bigLm_Impl(SEXP X, SEXP y, SEXP type)
{
BEGIN_RCPP
Rcpp::RObject __result;
Rcpp::RNGScope __rngScope;
Rcpp::traits::input_parameter< Rcpp::XPtr<BigMatrix> >::type X_(X);
Rcpp::traits::input_parameter< Rcpp::NumericVector >::type y_(y);
Rcpp::traits::input_parameter< int >::type type_(type);
__result = Rcpp::wrap(lmsol::fastLm(X_, y_, type_));
return __result;
END_RCPP
}``````

in`fastLm.cpp`. `// [[Rcpp::export]]` had some trouble doing this automatically, so the above is just what that should have created.

So now with the proper R functions to call this, we’re basically done. I had to create a few utility functions to make everything work nicely, but the main work is just the above.

One important detail: for now, we can only use the LLT and LDLT methods for computation, as the other decompositions create objects which scale with the size of the data, so for now I’m ignoring the more robust decompositions like QR. Perhaps someone else can figure out how to perform these in a memory-conscious manner.

### Comparison with biglm

Now we’ll run a (perhaps not-so-fair) comparison with the `biglm` function of biglm. Specifically, we’ll use the `biglm.big.matrix` function provided by the biganalytics which interfaces bigmemory and biglm. The following code creates a bigmemory object on disk (actually, two because biglm requires the matrix object to contain the response, whereas bigFastlm requires that the response be an R vector. It’s hard to say which is a better design choice, but I’m sticking with the approach which doesn’t allow an R formula expression).

``````suppressMessages(library(bigmemory))
suppressMessages(library(biganalytics))
suppressMessages(library(bigFastlm))

nrows <- 100000
ncols <- 100
bkFile <- "big_matrix.bk"
descFile <- "big_matrix.desc"
big_mat <- filebacked.big.matrix(nrow           = nrows,
ncol           = ncols,
type           = "double",
backingfile    = bkFile,
backingpath    = ".",
descriptorfile = descFile,
dimnames       = c(NULL, NULL))

set.seed(123)
for (i in 1:ncols) big_mat[, i] = rnorm(nrows, mean = 1/sqrt(i)) * i

bkFile <- "big_matrix2.bk"
descFile <- "big_matrix2.desc"
big_mat2 <- filebacked.big.matrix(nrow           = nrows,
ncol           = ncols + 1,
type           = "double",
backingfile    = bkFile,
backingpath    = ".",
descriptorfile = descFile,
dimnames       = c(NULL, NULL))

for (i in 1:ncols) big_mat2[, i + 1] = big_mat[, i]

y <- rnorm(nrows)
big_mat2[,1] <- y

options(bigmemory.allow.dimnames=TRUE)
colnames(big_mat2) <- c("y", paste0("V", 1:ncols))

options(bigmemory.allow.dimnames=FALSE)

## create formula for biglm
form <- as.formula(paste("y ~ -1 +", paste(paste0("V", 1:ncols), collapse = "+")))``````

Now let’s see how `biglm` and `bigLm` (eigen + bigmemory) stack up. Note that this is an unfair comparison because `biglm` requires a formula argument and `bigLm` assumes you’re passing in a design matrix already and biglm uses the QR decomposition, which is slower than LLT or LDLT.

``````library(microbenchmark)

res <- microbenchmark(biglm.obj  <- biglm.big.matrix(form, data = big_mat2),
bigLm.obj  <- bigLm(big_mat, y),
bigLm.obj2 <- bigLmPure(big_mat, y), ## a slightly faster version that doesn't check for intercept
times = 10L)
print(summary(res)[,1:7], digits = 4)``````
``````##                                                   expr     min      lq
## 1 biglm.obj <- biglm.big.matrix(form, data = big_mat2) 1387.00 1456.28
## 2                       bigLm.obj <- bigLm(big_mat, y)  104.97  108.27
## 3                  bigLm.obj2 <- bigLmPure(big_mat, y)   95.67   98.16
##     mean median     uq    max
## 1 1709.1 1678.3 1978.7 2023.8
## 2  117.8  111.6  118.9  163.3
## 3  102.1  102.2  105.2  112.0``````
``max(abs(coef(biglm.obj) - coef(bigLm.obj)))``
``##  3.339343e-17``

`bigLm` seems to be quite a bit faster than `biglm`, but how fast is `bigLm` compared with `fastLm` (which requires the data to be loaded into memory)? It turns out it’s pretty close on my computer and I don’t even have anything fancy like a solid state drive.

``````suppressMessages(library(RcppEigen))

mat.obj <- big_mat[,]

## both using the LLT decomposition
res <- microbenchmark(fastLm.obj.llt   <- fastLm(mat.obj, y, method = 2L),     # LLT Cholesky
bigLm.obj.llt    <- bigLm(big_mat, y),                   # LLT Cholesky
fastLm.obj2      <- fastLmPure(mat.obj, y, method = 2L),
bigLm.obj2       <- bigLmPure(big_mat, y),  ## a slightly faster version that doesn't check for intercept
fastLm.obj.ldlt  <- fastLmPure(mat.obj, y, method = 3L), # LDLT Cholesky
bigLm.obj.ldlt   <- bigLmPure(big_mat, y, method = 1L),  # LDLT Cholesky
fastLm.obj.qrpiv <- fastLmPure(mat.obj, y, method = 0L), # column-pivoted QR
fastLm.obj.qr    <- fastLmPure(mat.obj, y, method = 1L), # unpivoted QR
times = 25L)
print(summary(res)[,1:7], digits = 4)``````
``````##                                                      expr    min     lq
## 1       fastLm.obj.llt <- fastLm(mat.obj, y, method = 2L) 272.72 393.90
## 2                      bigLm.obj.llt <- bigLm(big_mat, y) 103.52 106.61
## 3      fastLm.obj2 <- fastLmPure(mat.obj, y, method = 2L)  87.55  90.99
## 4                     bigLm.obj2 <- bigLmPure(big_mat, y)  96.26 102.15
## 5  fastLm.obj.ldlt <- fastLmPure(mat.obj, y, method = 3L)  87.68  90.95
## 6    bigLm.obj.ldlt <- bigLmPure(big_mat, y, method = 1L)  96.29  99.36
## 7 fastLm.obj.qrpiv <- fastLmPure(mat.obj, y, method = 0L) 592.81 607.57
## 8    fastLm.obj.qr <- fastLmPure(mat.obj, y, method = 1L) 460.69 481.82
##     mean median     uq    max
## 1 390.43 398.37 405.25 472.56
## 2 110.31 109.16 112.57 119.83
## 3  92.70  92.56  94.73  99.06
## 4 103.24 103.29 105.69 109.56
## 5  94.71  94.48  96.49 112.28
## 6 102.44 102.53 104.90 108.80
## 7 616.19 615.64 621.77 651.15
## 8 493.23 494.02 505.28 516.39``````
``max(abs(coef(fastLm.obj.llt) - coef(bigLm.obj.llt)))``
``##  2.645453e-17``

Future work would be to try to figure out how to make the QR decomposition memory-feasible and also to write a function for generalized linear models.