Skip to contents

Overview

This vignette benchmarks the forestBalance pipeline to characterize how speed and memory scale with sample size (nn) and number of trees (BB).

The pipeline has four stages, each optimized:

  1. Forest fitting (grf::multi_regression_forest) – C++ via grf.
  2. Leaf node extraction (get_leaf_node_matrix) – C++ via Rcpp.
  3. Indicator matrix construction (leaf_node_kernel_Z) – C++ via Rcpp, building CSC sparse format directly without sorting.
  4. Weight computation (kernel_balance) – adaptive solver:
    • Direct: sparse Cholesky on treated/control sub-blocks. Preferred for smaller problems.
    • CG: conjugate gradient using the factored ZZ representation. No kernel matrix is formed. Preferred for large nn or dense kernels.

Mathematical background

The kernel energy balancing system

Given a kernel matrix Kn×nK \in \mathbb{R}^{n \times n} and a binary treatment vector A{0,1}nA \in \{0,1\}^n with n1n_1 treated and n0n_0 control units, the kernel energy balancing weights ww are obtained by solving the linear system Kqw=z, K_q \, w = z, where KqK_q is the modified kernel and zz is a right-hand side vector determined by a linear constraint.

The modified kernel is defined element-wise as Kq(i,j)=AiAjK(i,j)n12+(1Ai)(1Aj)K(i,j)n02. K_q(i,j) = \frac{A_i \, A_j \, K(i,j)}{n_1^2} + \frac{(1-A_i)(1-A_j) \, K(i,j)}{n_0^2}.

A critical structural observation is that Kq(i,j)=0K_q(i,j) = 0 whenever AiAjA_i \neq A_j: the treated–control cross-blocks are identically zero. Therefore KqK_q is block-diagonal: Kq=(Ktt/n1200Kcc/n02), K_q = \begin{pmatrix} K_{tt} / n_1^2 & 0 \\ 0 & K_{cc} / n_0^2 \end{pmatrix}, where Ktt=K[A=1,A=1]K_{tt} = K[A{=}1,\; A{=}1] is the n1×n1n_1 \times n_1 treated sub-block and Kcc=K[A=0,A=0]K_{cc} = K[A{=}0,\; A{=}0] is the n0×n0n_0 \times n_0 control sub-block.

The right-hand side vector bb has a similarly separable structure. Writing 𝐫=K𝟏\mathbf{r} = K \mathbf{1} (the row sums of KK), we have bi=Airin1n+(1Ai)rin0n. b_i = \frac{A_i \, r_i}{n_1 \, n} + \frac{(1-A_i) \, r_i}{n_0 \, n}.

The full system decomposes into two independent sub-problems:

  • Treated block: solve Kttwt=n12ztK_{tt} \, w_t = n_1^2 \, z_t of size n1n_1,
  • Control block: solve Kccwc=n02zcK_{cc} \, w_c = n_0^2 \, z_c of size n0n_0,

where ztz_t and zcz_c are the constraint-adjusted right-hand sides for each group (each requiring two preliminary solves to determine the Lagrange multiplier).

The kernel factorization

The proximity kernel is K=ZZ/BK = Z Z^\top / B, where ZZ is a sparse n×Ln \times L indicator matrix (L=b=1BLbL = \sum_{b=1}^B L_b, total leaves across all trees). Each row of ZZ has exactly BB nonzero entries (one per tree), so ZZ has nBnB nonzeros total. The ZZ matrix is constructed in C++ directly in compressed sparse column (CSC) format, avoiding the overhead of R’s triplet sorting.

Direct solver (block Cholesky)

For moderate nn, we form the sub-block kernels explicitly: Ktt=ZtZt/B,Kcc=ZcZc/B, K_{tt} = Z_t Z_t^\top / B, \qquad K_{cc} = Z_c Z_c^\top / B, where Zt=Z[A=1,]Z_t = Z[A{=}1, \,\cdot\,] and Zc=Z[A=0,]Z_c = Z[A{=}0, \,\cdot\,]. Each sub-block is computed via a sparse tcrossprod. The linear systems are then solved by sparse Cholesky factorization.

Computational Cost: O(nBs)O(nB \cdot \bar{s}) for the two sub-block cross-products (where s\bar{s} is the average leaf size), plus O(n13/2+n03/2)O(n_1^{3/2} + n_0^{3/2}) for sparse Cholesky (in the best case).

CG solver (matrix-free)

For large nn, forming KttK_{tt} and KccK_{cc} becomes expensive. The conjugate gradient (CG) solver avoids forming any kernel matrix by operating on the factored representation. To solve Kttx=rZtZtx=Br, K_{tt} \, x = r \quad \Longleftrightarrow \quad Z_t Z_t^\top x = B \, r, CG only needs matrix–vector products of the form vZt(Ztv), v \;\mapsto\; Z_t \bigl(Z_t^\top v\bigr), each costing O(n1B)O(n_1 B) via two sparse matrix–vector multiplies. The same applies to the control block with ZcZ_c.

Here, the memory use is O(nB)O(nB) for ZZ alone, versus O(n2)O(n^2) for the kernel. Each CG iteration costs O(ngB)O(n_g B) (where ngn_g is the group size). Convergence typically requires 100–200 iterations, independent of nn, so the total cost is O(ngBTiter)O(n_g B \cdot T_{\text{iter}}). The six required solves (three per block) are independent and each converges in a similar number of iterations.

Computational Cost: O((n1+n0)BTiter)O\bigl((n_1 + n_0) \cdot B \cdot T_{\text{iter}}\bigr) where Titer100200T_{\text{iter}} \approx 100\text{--}200. This scales linearly in both nn and BB, making it the preferred solver for large problems.

End-to-end timing

We benchmark the full forest_balance() pipeline (forest fitting, leaf extraction, Z construction, and weight computation) across a range of sample sizes up to n=50,000n = 50{,}000 with B=1,000B = 1{,}000 trees. To show the effect of solver choice, we run each nn with both solver = "direct" and solver = "cg" where feasible:

n_vals <- c(500, 1000, 2500, 5000, 10000, 25000, 50000)
B <- 1000
p <- 10

bench <- do.call(rbind, lapply(n_vals, function(nn) {
  set.seed(123)
  dat <- simulate_data(n = nn, p = p)

  # Auto (default)
  t_auto <- system.time(
    fit_auto <- forest_balance(dat$X, dat$A, dat$Y, num.trees = B)
  )["elapsed"]

  # Direct (skip for n > 10000)
  if (nn <= 10000) {
    t_dir <- system.time(
      fit_dir <- forest_balance(dat$X, dat$A, dat$Y, num.trees = B,
                                solver = "direct")
    )["elapsed"]
  } else {
    t_dir <- NA
  }

  # CG (run for all n)
  t_cg <- system.time(
    fit_cg <- forest_balance(dat$X, dat$A, dat$Y, num.trees = B,
                             solver = "cg")
  )["elapsed"]

  data.frame(n = nn, trees = B,
             auto = t_auto, direct = t_dir, cg = t_cg,
             auto_solver = fit_auto$solver)
}))
Full pipeline time by solver.
n Trees Direct (s) CG (s) Auto picks
500 1000 0.08 0.15 cg
1000 1000 0.18 0.40 direct
2500 1000 0.73 0.99 direct
5000 1000 2.33 2.71 direct
10000 1000 10.75 9.23 direct
25000 1000 85.57 cg
50000 1000 248.59 cg
plot of chunk timing-plot
plot of chunk timing-plot

The circled points show which solver "auto" selects at each nn. The switchover depends on both the per-fold sample size and the adaptive min.node.size (which creates denser kernels at higher pp).

Scaling with number of trees

tree_vals <- c(200, 500, 1000, 2000)
n_test <- c(1000, 5000, 25000)

tree_bench <- do.call(rbind, lapply(n_test, function(nn) {
  do.call(rbind, lapply(tree_vals, function(B) {
    set.seed(123)
    dat <- simulate_data(n = nn, p = 10)
    t <- system.time(
      fit <- forest_balance(dat$X, dat$A, dat$Y, num.trees = B)
    )["elapsed"]
    data.frame(n = nn, trees = B, time = t)
  }))
}))
Pipeline time across tree counts.
n Trees Time (s)
1000 200 0.06
1000 500 0.10
1000 1000 0.19
1000 2000 0.37
5000 200 0.97
5000 500 1.49
5000 1000 3.01
5000 2000 5.39
25000 200 22.00
25000 500 53.07
25000 1000 85.11
25000 2000 149.72
plot of chunk tree-plot
plot of chunk tree-plot

The pipeline scales approximately linearly in both nn and BB.

Pipeline stage breakdown

The following experiment profiles each stage of the pipeline individually to show where time is spent at different scales:

breakdown <- do.call(rbind, lapply(
  list(c(1000, 500), c(5000, 500), c(5000, 2000),
       c(25000, 500), c(25000, 1000)),
  function(cfg) {
    nn <- cfg[1]; B_val <- cfg[2]
    set.seed(123)
    dat <- simulate_data(n = nn, p = 10)
    mns <- max(20L, min(floor(nn / 200) + 10, floor(nn / 50)))

    # 1. Forest fitting
    t_fit <- system.time({
      resp <- scale(cbind(dat$A, dat$Y))
      forest <- grf::multi_regression_forest(dat$X, Y = resp,
                        num.trees = B_val, min.node.size = mns)
    })["elapsed"]

    # 2. Leaf extraction (Rcpp)
    t_leaf <- system.time(
      leaf_mat <- get_leaf_node_matrix(forest, dat$X)
    )["elapsed"]

    # 3. Z construction (Rcpp)
    t_z <- system.time(
      Z <- leaf_node_kernel_Z(leaf_mat)
    )["elapsed"]

    # 4. Weight computation (kernel_balance)
    t_bal <- system.time(
      w <- kernel_balance(dat$A, Z = Z, num.trees = B_val, solver = "cg")
    )["elapsed"]

    data.frame(n = nn, trees = B_val, mns = mns,
               fit = t_fit, leaf = t_leaf, Z = t_z, balance = t_bal,
               total = t_fit + t_leaf + t_z + t_bal)
  }
))
Wall-clock time for each pipeline stage across configurations (CG solver, single fold).
n Trees mns Forest fit (s) Leaf extract (s) Z build (s) Balance (s) Total (s)
1000 500 20 0.03 0.02 0.004 0.13 0.18
5000 500 35 0.18 0.10 0.013 1.24 1.54
5000 2000 35 0.73 0.42 0.055 4.89 6.10
25000 500 135 1.12 0.52 0.077 54.11 55.83
25000 1000 135 2.09 1.04 0.207 86.41 89.74
plot of chunk breakdown-plot
plot of chunk breakdown-plot

At small nn, forest fitting and the balance solver take similar time. At large nn, the CG solver dominates — its cost grows with the number of iterations and the sparse matrix–vector product size. The C++ stages (leaf extraction and ZZ construction) remain a small fraction throughout.

Memory usage

When the direct solver is used, the kernel is formed as a sparse matrix. For the CG solver, only the sparse indicator matrix ZZ is stored. The table below shows the actual memory usage compared to the theoretical dense kernel:

mem_data <- do.call(rbind, lapply(
  c(500, 1000, 2500, 5000, 10000, 25000, 50000),
  function(nn) {
    set.seed(123)
    dat <- simulate_data(n = nn, p = 10)
    B_val <- 1000
    mns <- max(20L, min(floor(nn / 200) + ncol(dat$X), floor(nn / 50)))

    fit_forest <- multi_regression_forest(
      dat$X, scale(cbind(dat$A, dat$Y)),
      num.trees = B_val, min.node.size = mns
    )
    leaf_mat <- get_leaf_node_matrix(fit_forest, dat$X)

    Z <- leaf_node_kernel_Z(leaf_mat)
    z_mb <- as.numeric(object.size(Z)) / 1e6

    # Kernel sparsity (skip forming K for very large n)
    if (nn <= 10000) {
      K <- leaf_node_kernel(leaf_mat)
      pct_nz <- round(100 * length(K@x) / as.numeric(nn)^2, 1)
      k_mb <- round(as.numeric(object.size(K)) / 1e6, 1)
    } else {
      pct_nz <- NA
      k_mb <- NA
    }

    n_fold <- nn %/% 2
    auto_solver <- if (n_fold > 5000 || mns > n_fold / 20) "cg" else "direct"
    dense_mb <- round(8 * as.numeric(nn)^2 / 1e6, 0)

    data.frame(n = nn, mns = mns, pct_nz = pct_nz,
               sparse_MB = k_mb, Z_MB = round(z_mb, 1),
               dense_MB = dense_mb, solver = auto_solver)
  }
))
Memory usage with adaptive leaf size (p = 10).
n mns Solver K % nonzero Stored Actual (MB) Dense K (MB) Actual / Dense
500 20 cg 39.4% Z (factor) 6.0 2 300%
1000 20 direct 28.8% K (sparse) 3.5 8 43.8%
2500 22 direct 20.1% K (sparse) 15.1 50 30.2%
5000 35 direct 16% K (sparse) 48.1 200 24%
10000 60 direct 12.6% K (sparse) 151.7 800 19%
25000 135 cg Z (factor) 300.4 5000 6%
50000 260 cg Z (factor) 600.3 20000 3%
plot of chunk memory-plot
plot of chunk memory-plot

At n=50,000n = 50{,}000, a dense kernel would require 19 GB. The CG solver stores only the ZZ matrix, using a small fraction of this.

Summary

  • The full forest_balance() pipeline scales to n=50,000n = 50{,}000 with 1,000 trees.
  • All computationally intensive steps are in compiled code: C++ for leaf extraction and ZZ construction (Rcpp), BLAS for sparse matrix operations, and CHOLMOD for sparse Cholesky.
  • The adaptive min.node.size heuristic adjusts leaf size to both nn and pp, creating denser kernels than the old default of 10.
  • The solver switchover between direct and CG depends on both the per-fold sample size and the kernel density. The "auto" setting handles this transparently.
  • The pipeline scales approximately linearly in both nn and BB.