Skip to contents

The overfitting problem

The forest balance estimator uses the data twice: once to fit the random forest that defines the kernel, and again to estimate the treatment effect using that kernel. This creates a subtle overfitting bias that persists even at large sample sizes.

To see this, we compare the standard (no cross-fitting) estimator with small leaf size against the cross-fitted estimator with adaptive leaf size and an oracle that uses the true propensity scores. We use n=10,000n = 10{,}000 and p=50p = 50 covariates:

library(forestBalance)

set.seed(123)
nreps <- 2

res <- matrix(NA, nreps, 3,
              dimnames = list(NULL, c("No CF (mns=10)", "CF (default)", "Oracle IPW")))

for (r in seq_len(nreps)) {
  dat <- simulate_data(n = 5000, p = 50, ate = 0)

  # Standard: no cross-fitting, small min.node.size
  fit_nocf <- forest_balance(dat$X, dat$A, dat$Y,
                              cross.fitting = FALSE, min.node.size = 10,
                              num.trees = 500)
  res[r, "No CF (mns=10)"] <- fit_nocf$ate

  # Cross-fitted with adaptive leaf size (package default)
  fit_cf <- forest_balance(dat$X, dat$A, dat$Y, num.trees = 500)
  res[r, "CF (default)"] <- fit_cf$ate

  # Oracle IPW (true propensity scores)
  ps <- dat$propensity
  w_ipw <- ifelse(dat$A == 1, 1 / ps, 1 / (1 - ps))
  res[r, "Oracle IPW"] <- weighted.mean(dat$Y[dat$A == 1], w_ipw[dat$A == 1]) -
                           weighted.mean(dat$Y[dat$A == 0], w_ipw[dat$A == 0])
}
n = 5,000, p = 50, true ATE = 0, 2 replications.
Method Bias SD RMSE
No CF (mns=10) 0.1912 0.0288 0.1934
CF (default) 0.0021 0.0314 0.0315
Oracle IPW 0.0228 0.0149 0.0272

The no-cross-fitting estimator with small leaves has substantial bias. The default cross-fitted estimator with adaptive leaf size achieves much lower bias and RMSE.

Cross-fitting details

The idea follows the double/debiased machine learning (DML) framework of Chernozhukov et al. (2018), adapted to kernel energy balancing.

Let KK denote the n×nn \times n proximity kernel built from a random forest trained on the full sample (X,A,Y)(X, A, Y). The kernel captures which observations are “similar” in terms of confounding structure. However, because KK was estimated from the same data used to compute the ATE, the kernel overfits: it encodes information about the specific realisation of outcomes, not just the confounding structure. This creates a bias that does not vanish as nn \to \infty.

K-fold cross-fitting

Given KK folds, the cross-fitted estimator proceeds as follows:

  1. Randomly partition the data into KK roughly equal folds F1,,FKF_1, \ldots, F_K.

  2. For each fold k=1,,Kk = 1, \ldots, K:

    1. Train a multi_regression_forest on the data in folds FkF_{-k} (all folds except kk).
    2. Using this held-out forest, predict leaf node memberships for the observations in fold FkF_k.
    3. Build the proximity kernel KkK_k (nk×nkn_k \times n_k) from these out-of-sample leaf predictions.
    4. Compute kernel energy balancing weights wkw_k for the observations in fold FkF_k using KkK_k.
    5. Estimate the fold-level ATE via the Hajek estimator: τ̂k=iFkwiAiYiiFkwiAiiFkwi(1Ai)YiiFkwi(1Ai).\hat\tau_k = \frac{\sum_{i \in F_k} w_i A_i Y_i} {\sum_{i \in F_k} w_i A_i} - \frac{\sum_{i \in F_k} w_i (1-A_i) Y_i} {\sum_{i \in F_k} w_i (1-A_i)}.
  3. Average the fold-level estimates (DML1): τ̂CF=1Kk=1Kτ̂k.\hat\tau_{\mathrm{CF}} = \frac{1}{K} \sum_{k=1}^{K} \hat\tau_k.

The role of leaf size

Cross-fitting alone is not sufficient to eliminate bias. The minimum leaf size (min.node.size) plays a crucial role:

  • Small leaves (e.g., min.node.size = 5): the kernel is very granular, distinguishing observations at a fine scale. But with out-of-sample predictions, small leaves lead to noisy similarity estimates—two observations may share a tiny leaf by chance rather than true similarity.

  • Large leaves (e.g., min.node.size = 50--100): the kernel captures broader confounding structure. Out-of-sample predictions are more stable because large leaves better represent population-level similarity.

The optimal leaf size scales with both nn (more data supports finer leaves) and pp (more covariates require coarser leaves to avoid the curse of dimensionality). forestBalance uses an adaptive heuristic:

min.node.size=max(20,min(n/200+p,n/50)).\mathrm{min.node.size} = \max\!\Big(20,\;\min\!\big(\lfloor n/200 \rfloor + p,\;\lfloor n/50 \rfloor\big)\Big).

set.seed(123)
nreps <- 2
node_sizes <- c(5, 10, 20, 50, 75, 100)
n <- 5000; p <- 50

leaf_res <- do.call(rbind, lapply(node_sizes, function(mns) {
  ates <- replicate(nreps, {
    dat <- simulate_data(n = n, p = p, ate = 0)
    forest_balance(dat$X, dat$A, dat$Y, num.trees = 500,
                   min.node.size = mns)$ate
  })
  data.frame(mns = mns, bias = mean(ates), sd = sd(ates),
             rmse = sqrt(mean(ates)^2 + var(ates)))
}))

heuristic <- max(20, min(floor(n / 200) + p, floor(n / 50)))
Cross-fitted estimator (n = 5,000, p = 50, 2 reps). Arrow marks the adaptive default (mns = 75).
min.node.size Bias SD RMSE
5 0.1716 0.0134 0.1721
10 0.1017 0.0015 0.1017
20 0.0153 0.0447 0.0472
50 0.0481 0.0063 0.0485
75 -0.0099 0.0351 0.0364 <–
100 0.0332 0.0007 0.0332

Bias decreases with larger leaves until variance begins to dominate. The adaptive default balances bias reduction against variance.

Practical usage

The default call uses cross-fitting with the adaptive leaf size:

dat <- simulate_data(n = 2000, p = 10, ate = 0)
fit <- forest_balance(dat$X, dat$A, dat$Y)
fit
#> Forest Kernel Energy Balancing (cross-fitted)
#> -------------------------------------------------- 
#>   n = 2,000  (n_treated = 745, n_control = 1255)
#>   Trees: 1000
#>   Cross-fitting: 2 folds
#>   Solver: direct
#>   ATE estimate: -0.0315
#>   Fold ATEs: -0.0808, 0.0177
#>   ESS: treated = 529/745 (71%)   control = 878/1255 (70%)
#> -------------------------------------------------- 
#> Use summary() for covariate balance details.

To disable cross-fitting (e.g., for speed or to inspect the kernel):

fit_nocf <- forest_balance(dat$X, dat$A, dat$Y, cross.fitting = FALSE)
fit_nocf
#> Forest Kernel Energy Balancing
#> -------------------------------------------------- 
#>   n = 2,000  (n_treated = 745, n_control = 1255)
#>   Trees: 1000
#>   Solver: direct
#>   ATE estimate: -0.0215
#>   ESS: treated = 459/745 (62%)   control = 759/1255 (60%)
#> -------------------------------------------------- 
#> Use summary() for covariate balance details.

To manually set the leaf size:

fit_custom <- forest_balance(dat$X, dat$A, dat$Y, min.node.size = 50)
fit_custom
#> Forest Kernel Energy Balancing (cross-fitted)
#> -------------------------------------------------- 
#>   n = 2,000  (n_treated = 745, n_control = 1255)
#>   Trees: 1000
#>   Cross-fitting: 2 folds
#>   Solver: direct
#>   ATE estimate: -0.0784
#>   Fold ATEs: -0.0918, -0.065
#>   ESS: treated = 424/745 (57%)   control = 739/1255 (59%)
#> -------------------------------------------------- 
#> Use summary() for covariate balance details.

Choosing the number of folds

The default is num.folds = 2 (sample splitting). With two folds, each fold retains half the observations, producing high-quality per-fold kernels. More folds train the forest on more data but produce smaller per-fold kernels. Our experiments show that the number of folds has a modest effect compared to the leaf size. Values of 2–5 all work well:

set.seed(123)
nreps <- 2
n <- 5000

fold_res <- do.call(rbind, lapply(c(2, 3, 5, 10), function(nfolds) {
  ates <- replicate(nreps, {
    dat <- simulate_data(n = n, p = 10, ate = 0)
    forest_balance(dat$X, dat$A, dat$Y, num.trees = 500,
                   num.folds = nfolds)$ate
  })
  data.frame(folds = nfolds, bias = round(mean(ates), 4),
             sd = round(sd(ates), 4),
             rmse = round(sqrt(mean(ates)^2 + var(ates)), 4))
}))
Effect of number of folds (n = 5,000, adaptive leaf size).
Folds Bias SD RMSE
2 0.0907 0.0456 0.1015
3 0.0246 0.0275 0.0369
5 0.1025 0.0096 0.1030
10 0.2065 0.1269 0.2424

References

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68.

De, S. and Huling, J.D. (2025). Data adaptive covariate balancing for causal effect estimation for high dimensional data. arXiv preprint arXiv:2512.18069.