The bigmemory package offers a set of tools for R which allow for manipulation larger-than-memory objects within R. It has some basic functions but is certainly not comprehensive. The eigen C++ linear algebra library is a highly efficient numerical linear algebra library and can be interfaced to R through RcppEigen by Douglas Bates and Dirk Eddelbuettel. If bigmemory and Eigen can be linked, then one would be able to do highly efficient linear algebra computation on data that is too big for memory (exactly what you thought R couldn’t do).

Since bigmemory works with pointers to C++ objects, it’s natural to link bigmemory objects to Eigen matrix objects. I’m not going to go too much into the details of this from the bigmemory/Rcpp side of things, as it’s well exposed here.

In this post I’ll create a colSums() function and a crossprod() function for big.matrix objects. All of the code posted below can be found in my rfunctions R package on github. big.matrix objects can have one of 4 types (1, 2, 4, 8), corresponding to (char, short, int, double), so we need to define extra Eigen matrix types like the following MatrixXi/VectorXi for ints and MatrixXd/VectorXd for doubles are already defined):

typedef Eigen::Matrix<char, Eigen::Dynamic, Eigen::Dynamic> MatrixXchar;
typedef Eigen::Matrix<short, Eigen::Dynamic, Eigen::Dynamic> MatrixXshort;
typedef Eigen::Matrix<char, Eigen::Dynamic, 1> Vectorchar;
typedef Eigen::Matrix<short, Eigen::Dynamic, 1> Vectorshort;

Then reading’’ in a big.matrix object from R to C++ and getting its data type looks like the following:

using namespace Rcpp;
using namespace RcppEigen;
#include <bigmemory/BigMatrix.h>

RcppExport SEXP colsums_big(SEXP X_)
{
BEGIN_RCPP
using Eigen::Map;
using Eigen::MatrixXd;
using Eigen::VectorXd;

typedef Eigen::Matrix<char, Eigen::Dynamic, Eigen::Dynamic> MatrixXchar;
typedef Eigen::Matrix<short, Eigen::Dynamic, Eigen::Dynamic> MatrixXshort;
typedef Eigen::Matrix<char, Eigen::Dynamic, 1> Vectorchar;
typedef Eigen::Matrix<short, Eigen::Dynamic, 1> Vectorshort;

XPtr<BigMatrix> xpMat(X_);

unsigned int type = xpMat->matrix_type();

END_RCPP
}


Then in order to associate the data from xpMat with an Eigen matrix object, we use the Eigen map (map)functionality to map the big.matrix data into an Eigen object (without copying it and hence loading it to memory). For data with the double type, this looks like:

Map<MatrixXd> bM = Map<MatrixXd>((double *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol()  );

where bM is the new Eigen object pointing to the big.matrix data located on disk. Now we are basically done. Performing the column-wise sum in Eigen is straightforward:

VectorXd colSums = bM.colwise().sum();
return wrap(colSums);

Putting it altogether:

using namespace Rcpp;
using namespace RcppEigen;
#include <bigmemory/BigMatrix.h>

RcppExport SEXP colsums_big(SEXP X_)
{
BEGIN_RCPP
using Eigen::Map;
using Eigen::MatrixXd;
using Eigen::VectorXd;

typedef Eigen::Matrix<char, Eigen::Dynamic, Eigen::Dynamic> MatrixXchar;
typedef Eigen::Matrix<short, Eigen::Dynamic, Eigen::Dynamic> MatrixXshort;
typedef Eigen::Matrix<char, Eigen::Dynamic, 1> Vectorchar;
typedef Eigen::Matrix<short, Eigen::Dynamic, 1> Vectorshort;

XPtr<BigMatrix> xpMat(X_);

unsigned int type = xpMat->matrix_type();

if (type == 1)
{
Map<MatrixXchar> bM = Map<MatrixXchar>((char *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol()  );
Vectorchar colSums = bM.colwise().sum();
return wrap(colSums);
} else if (type == 2)
{
Map<MatrixXshort> bM = Map<MatrixXshort>((short *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol()  );
Vectorshort colSums = bM.colwise().sum();
return wrap(colSums);
} else if (type == 4)
{
Map<MatrixXi> bM = Map<MatrixXi>((int *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol()  );
VectorXi colSums = bM.colwise().sum();
return wrap(colSums);
} else if (type == 8)
{
Map<MatrixXd> bM = Map<MatrixXd>((double *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol()  );
VectorXd colSums = bM.colwise().sum();
return wrap(colSums);
} else {
throw Rcpp::exception("Undefined type for provided big.matrix");
}

END_RCPP
}


If we want to make a crossprod function for big.matrix objects (ie computing $$X^TX$$), then we would do this with the following:

int p = bM.cols();
return wrap(crossprod);

Now let’s run a big example to demonstrate the performance. The R function which calls colsums_big is called big.colSums() and the corresponding crossprod function is called big.crossprod(). If we have a big.matrix object big_mat, then the data can be loaded into memory as a matrix as big_mat[,], so we can compare with the standard R functions for colSums and crossprod.

suppressMessages(library(bigmemory))
suppressMessages(library(rfunctions))

nrows <- 500000
ncols <- 500
bkFile <- "big_matrix.bk"
descFile <- "big_matrix.desc"
big_mat <- filebacked.big.matrix(nrow=nrows, ncol=ncols, type="double",
backingfile=bkFile, backingpath=".",
descriptorfile=descFile,
dimnames=c(NULL,NULL))

set.seed(123)
for (i in 1:ncols) big_mat[,i] = rnorm(nrows, mean = 1/sqrt(i))*i

library(microbenchmark)

res <- microbenchmark(cs1 <- colSums(big_mat[,]),
cs2 <- big.colSums(big_mat),
times = 25L)
print(summary(res)[,1:7], digits = 5)
##                          expr     min     lq    mean  median      uq
## 1 cs1 <- colSums(big_mat[, ]) 1302.36 1502.6 1640.12 1571.61 1717.86
## 2 cs2 <- big.colSums(big_mat)  105.26  111.4  123.02  115.14  118.63
##       max
## 1 3250.77
## 2  272.96
all.equal(cs1, cs2)
## [1] TRUE

The memory usage is obviously much lower when we don’t load the big.matrix object into memory too.

## memory usage of loading the data
Rprof("base_mem.out", memory.profiling = TRUE)
# Call the function to be profiled
cs1 <- colSums(big_mat[,])
Rprof(NULL)
summaryRprof("base_mem.out", memory = "stats")[[1]][4]
## max.vsize.large
##       501377612
gc()
##           used (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells 1637506 87.5    2637877  140.9   2473902  132.2
## Vcells 2806498 21.5  579488417 4421.2 503295945 3839.9
Rprof("eigen_mem.out", memory.profiling = TRUE)
# Call the function to be profiled
cs2 <- big.colSums(big_mat)
Rprof(NULL)
summaryRprof("eigen_mem.out", memory = "stats")[[1]][4]
## max.vsize.large
##         1400398
res <- microbenchmark(cp1 <- crossprod(big_mat[,]),
cp2 <- big.crossprod(big_mat),
times = 5L)
print(summary(res)[,1:7], digits = 4)
##                            expr    min    lq   mean median     uq   max
## 1 cp1 <- crossprod(big_mat[, ]) 75.203 75.81 76.410 76.402 76.409 78.23
## 2 cp2 <- big.crossprod(big_mat)  9.559  9.60  9.725  9.657  9.747 10.06
all.equal(cp1, cp2)
## [1] TRUE

In a following post I’ll investigate fitting linear models via Eigen and bigmemory big.matrix objects and see how the speed compares with the biglm package.