Compute random forest proximity kernel from a leaf node matrix
Source:R/leaf_node_kernel.R
leaf_node_kernel.RdGiven a matrix of leaf node assignments (observations x trees), computes the \(n \times n\) kernel matrix where entry \((i,j)\) is the proportion of trees in which observations \(i\) and \(j\) fall into the same leaf node.
Arguments
- leaf_matrix
An integer matrix of dimension \(n \times B\), where
leaf_matrix[i, b]is the leaf node ID for observationiin treeb. Typically produced byget_leaf_node_matrix.- sparse
Logical; if
TRUE(default), returns a sparsedgCMatrixfrom the Matrix package. IfFALSE, returns a dense base R matrix.
Value
A symmetric \(n \times n\) matrix (sparse or dense depending on
sparse) where entry \((i,j)\) equals
$$K(i,j) = \frac{1}{B} \sum_{b=1}^{B} \mathbf{1}(L_b(i) = L_b(j)),$$
i.e., the fraction of trees where \(i\) and \(j\) share a leaf.
Details
For each tree \(b\), the leaf assignment defines a sparse \(n \times L_b\) indicator matrix \(Z_b\). This function stacks all \(B\) indicator matrices column-wise into a single sparse matrix \(Z = [Z_1 | Z_2 | \cdots | Z_B]\) of dimension \(n \times \sum_b L_b\). The kernel is then obtained via a single sparse cross-product: \(K = Z Z^\top / B\). Leaf ID remapping is done in C++ for speed.
Examples
# \donttest{
library(grf)
n <- 100
p <- 5
X <- matrix(rnorm(n * p), n, p)
Y <- cbind(X[, 1] + rnorm(n), X[, 2] + rnorm(n))
forest <- multi_regression_forest(X, Y, num.trees = 50)
leaf_mat <- get_leaf_node_matrix(forest)
K <- leaf_node_kernel(leaf_mat) # sparse
K_dense <- leaf_node_kernel(leaf_mat, sparse = FALSE) # dense
# }