The overfitting problem
The forest balance estimator uses the data twice: once to fit the random forest that defines the kernel, and again to estimate the treatment effect using that kernel. This creates a subtle overfitting bias that persists even at large sample sizes.
To see this, we compare the standard (no cross-fitting) estimator with small leaf size against the cross-fitted estimator with adaptive leaf size and an oracle that uses the true propensity scores. We use and covariates:
library(forestBalance)
set.seed(123)
nreps <- 2
res <- matrix(NA, nreps, 3,
dimnames = list(NULL, c("No CF (mns=10)", "CF (default)", "Oracle IPW")))
for (r in seq_len(nreps)) {
dat <- simulate_data(n = 5000, p = 50, ate = 0)
# Standard: no cross-fitting, small min.node.size
fit_nocf <- forest_balance(dat$X, dat$A, dat$Y,
cross.fitting = FALSE, min.node.size = 10,
num.trees = 500)
res[r, "No CF (mns=10)"] <- fit_nocf$ate
# Cross-fitted with adaptive leaf size (package default)
fit_cf <- forest_balance(dat$X, dat$A, dat$Y, num.trees = 500)
res[r, "CF (default)"] <- fit_cf$ate
# Oracle IPW (true propensity scores)
ps <- dat$propensity
w_ipw <- ifelse(dat$A == 1, 1 / ps, 1 / (1 - ps))
res[r, "Oracle IPW"] <- weighted.mean(dat$Y[dat$A == 1], w_ipw[dat$A == 1]) -
weighted.mean(dat$Y[dat$A == 0], w_ipw[dat$A == 0])
}| Method | Bias | SD | RMSE |
|---|---|---|---|
| No CF (mns=10) | 0.1912 | 0.0288 | 0.1934 |
| CF (default) | 0.0021 | 0.0314 | 0.0315 |
| Oracle IPW | 0.0228 | 0.0149 | 0.0272 |
The no-cross-fitting estimator with small leaves has substantial bias. The default cross-fitted estimator with adaptive leaf size achieves much lower bias and RMSE.
Cross-fitting details
The idea follows the double/debiased machine learning (DML) framework of Chernozhukov et al. (2018), adapted to kernel energy balancing.
Let denote the proximity kernel built from a random forest trained on the full sample . The kernel captures which observations are “similar” in terms of confounding structure. However, because was estimated from the same data used to compute the ATE, the kernel overfits: it encodes information about the specific realisation of outcomes, not just the confounding structure. This creates a bias that does not vanish as .
K-fold cross-fitting
Given folds, the cross-fitted estimator proceeds as follows:
Randomly partition the data into roughly equal folds .
-
For each fold :
- Train a
multi_regression_foreston the data in folds (all folds except ). - Using this held-out forest, predict leaf node memberships for the observations in fold .
- Build the proximity kernel () from these out-of-sample leaf predictions.
- Compute kernel energy balancing weights for the observations in fold using .
- Estimate the fold-level ATE via the Hajek estimator:
- Train a
Average the fold-level estimates (DML1):
The role of leaf size
Cross-fitting alone is not sufficient to eliminate bias. The
minimum leaf size (min.node.size) plays a
crucial role:
Small leaves (e.g.,
min.node.size = 5): the kernel is very granular, distinguishing observations at a fine scale. But with out-of-sample predictions, small leaves lead to noisy similarity estimates—two observations may share a tiny leaf by chance rather than true similarity.Large leaves (e.g.,
min.node.size = 50--100): the kernel captures broader confounding structure. Out-of-sample predictions are more stable because large leaves better represent population-level similarity.
The optimal leaf size scales with both
(more data supports finer leaves) and
(more covariates require coarser leaves to avoid the curse of
dimensionality). forestBalance uses an adaptive
heuristic:
set.seed(123)
nreps <- 2
node_sizes <- c(5, 10, 20, 50, 75, 100)
n <- 5000; p <- 50
leaf_res <- do.call(rbind, lapply(node_sizes, function(mns) {
ates <- replicate(nreps, {
dat <- simulate_data(n = n, p = p, ate = 0)
forest_balance(dat$X, dat$A, dat$Y, num.trees = 500,
min.node.size = mns)$ate
})
data.frame(mns = mns, bias = mean(ates), sd = sd(ates),
rmse = sqrt(mean(ates)^2 + var(ates)))
}))
heuristic <- max(20, min(floor(n / 200) + p, floor(n / 50)))| min.node.size | Bias | SD | RMSE | |
|---|---|---|---|---|
| 5 | 0.1716 | 0.0134 | 0.1721 | |
| 10 | 0.1017 | 0.0015 | 0.1017 | |
| 20 | 0.0153 | 0.0447 | 0.0472 | |
| 50 | 0.0481 | 0.0063 | 0.0485 | |
| 75 | -0.0099 | 0.0351 | 0.0364 | <– |
| 100 | 0.0332 | 0.0007 | 0.0332 |
Bias decreases with larger leaves until variance begins to dominate. The adaptive default balances bias reduction against variance.
Practical usage
The default call uses cross-fitting with the adaptive leaf size:
dat <- simulate_data(n = 2000, p = 10, ate = 0)
fit <- forest_balance(dat$X, dat$A, dat$Y)
fit
#> Forest Kernel Energy Balancing (cross-fitted)
#> --------------------------------------------------
#> n = 2,000 (n_treated = 745, n_control = 1255)
#> Trees: 1000
#> Cross-fitting: 2 folds
#> Solver: direct
#> ATE estimate: -0.0315
#> Fold ATEs: -0.0808, 0.0177
#> ESS: treated = 529/745 (71%) control = 878/1255 (70%)
#> --------------------------------------------------
#> Use summary() for covariate balance details.To disable cross-fitting (e.g., for speed or to inspect the kernel):
fit_nocf <- forest_balance(dat$X, dat$A, dat$Y, cross.fitting = FALSE)
fit_nocf
#> Forest Kernel Energy Balancing
#> --------------------------------------------------
#> n = 2,000 (n_treated = 745, n_control = 1255)
#> Trees: 1000
#> Solver: direct
#> ATE estimate: -0.0215
#> ESS: treated = 459/745 (62%) control = 759/1255 (60%)
#> --------------------------------------------------
#> Use summary() for covariate balance details.To manually set the leaf size:
fit_custom <- forest_balance(dat$X, dat$A, dat$Y, min.node.size = 50)
fit_custom
#> Forest Kernel Energy Balancing (cross-fitted)
#> --------------------------------------------------
#> n = 2,000 (n_treated = 745, n_control = 1255)
#> Trees: 1000
#> Cross-fitting: 2 folds
#> Solver: direct
#> ATE estimate: -0.0784
#> Fold ATEs: -0.0918, -0.065
#> ESS: treated = 424/745 (57%) control = 739/1255 (59%)
#> --------------------------------------------------
#> Use summary() for covariate balance details.Choosing the number of folds
The default is num.folds = 2 (sample splitting). With
two folds, each fold retains half the observations, producing
high-quality per-fold kernels. More folds train the forest on more data
but produce smaller per-fold kernels. Our experiments show that the
number of folds has a modest effect compared to the leaf size. Values of
2–5 all work well:
set.seed(123)
nreps <- 2
n <- 5000
fold_res <- do.call(rbind, lapply(c(2, 3, 5, 10), function(nfolds) {
ates <- replicate(nreps, {
dat <- simulate_data(n = n, p = 10, ate = 0)
forest_balance(dat$X, dat$A, dat$Y, num.trees = 500,
num.folds = nfolds)$ate
})
data.frame(folds = nfolds, bias = round(mean(ates), 4),
sd = round(sd(ates), 4),
rmse = round(sqrt(mean(ates)^2 + var(ates)), 4))
}))| Folds | Bias | SD | RMSE |
|---|---|---|---|
| 2 | 0.0907 | 0.0456 | 0.1015 |
| 3 | 0.0246 | 0.0275 | 0.0369 |
| 5 | 0.1025 | 0.0096 | 0.1030 |
| 10 | 0.2065 | 0.1269 | 0.2424 |
References
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68.
De, S. and Huling, J.D. (2025). Data adaptive covariate balancing for causal effect estimation for high dimensional data. arXiv preprint arXiv:2512.18069.