Statistical Machine Learning II

Variable relation analysis utilizing surrogate variables in random forests
Stephan Seifert1, Sven Gundlach2, Silke Szymczak3
1University of Hamburg; 2Kiel University; 3University of Lübeck

The machine learning approach random forests [1] can be successfully applied to omics data, such as gene expression data, for classification or regression. However, the interpretation of the trained prediction models is currently mainly limited to the selection of relevant variables identified based on so-called importance measurements of each individual variable. Thus, relationships between the predictor variables are not considered. We developed a new RF based variable selection method called Surrogate Minimal Depth (SMD) that incorporates variable relations into the selection process of important variables. [2] This is achieved by the exploitation of surrogate variables that have originally been introduced to deal with missing predictor variables. [3] In addition to improving variable selection, surrogate variables and their relationship to the primary split variables measured by the parameter mean adjusted agreement can also be utilized as proxy for the relations between the different variables. This relation analysis goes beyond the investigation of ordinary correlation coefficients because it takes into account the association with the outcome. I will present the basic concept of surrogate variables and mean adjusted agreement, as well as the relation analysis of simulated data as proof of concept and the investigation of experimental breast cancer gene expression datasets to show the practical applicability of this new approach.


[1] L. Breiman, Mach. Learn. 2001, 45, 5-32.

[2] S. Seifert, S. Gundlach, S. Szymczak, Bioinformatics 2019, 35, 3663-3671.

[3] L. Breiman, J. Friedman, C. J. Stone, R. A. Olshen, Classification and Regression Trees, Taylor & Francis, 1984.

Variable Importance in Random Forests in the Presence of Confounding
Robert Miltenberger, Christoph Wies, Gunter Grieser, Antje Jahn
University of applied sciences Darmstadt, Deutschland

Patients with a need for kidney transplantation suffer from a lack of available organ donors. Still, patients commonly reject an allocated kidney when they consider its quality to be insufficient [1]. Rejection is of major concern as it can reduce the organs quality due to prolonged ischemic time and thus its use for further patients. To better understand the association between organ quality and patient prognosis after transplantation, random survival forests will be applied to data on more than 50.000 kidney transplantations of the US organ transplantation registry. However, the US allocation process is allocating kidneys of high quality to patients with good prognosis. Thus confounding is of major concern and needs to be adressed.

In this talk, we investigate methods to address confounding in random forest analysis by using residuals from a generalized propensity score analysis. We show, that by considering the residuals instead of original variables the permutation variable importance measures refer to semipartial correlations between outcome and variable instead of correlations that are disturbed by confounder effects. This facilitates the interpretation of the variable importance measure. As our findings rely on linear models, we further investigate the approach for non-linear and non-additive models by the use of simulations.

The proposed method is used to analyse the impact of kidney quality on failure-free survival after transplantation based on the US registry data. Results are compared to other methods, that have been proposed for a better understanding and explainability of random forest analyses [2].

[1] Husain SA Association Between Declined Offers of Deceased Donor Kidney Allograft and Outcomes in Kidney Transplant Candidates. JAMA Netw Open. 2019; doi:10.1001/jamanetworkopen.2019.10312

[2] Paluszynska A, Przemyslaw Biecek P and Jiang Y (2020). randomForestExplainer: Explaining and Visualizing Random Forests in Terms of Variable Importance. R package version 0.10.1.

Interaction forests: Identifying and exploiting influential quantitative and qualitative interaction effects
Roman Hornung
University of Munich, Germany

Even though interaction effects are omnipresent in biomedical data and play a particularly prominent role in genetics, they are given little attention in analysis, in particular in prediction modelling. Identifying influential interaction effects is valuable, both, because they allow important insights into the interplay between the covariates and because these effects can be used to improve the prediction performance of automatic prediction rules.

Random forest is one of the most popular machine learning methods and known for its ability to capture complex non-linear dependencies between the covariates and the outcome. A key feature of random forest is that it allows to rank the considered covariates with respect to their contribution to prediction using various variable importance measures.

We developed ‚interaction forest‘, a variation of random forest for categorical, metric, and survival outcomes that explicitly considers several types of interaction effects in the splitting performed by the trees constituting the forest. The new ‚effect importance measure (EIM)‘ associated with interaction forest allows to rank the interaction effects between the covariate pairs with respect to their importance for prediction in addition to ranking the univariable effects of the covariates in this respect. Using EIM, separate importance value lists for univariable effects, quantitative interaction effects, and qualitative interaction effects are provided. In a real data study using 220 publicly available data sets it is seen that the prediction performance of interaction forest is statistically significantly better than that of random forest and competing random forest variants that, as does interaction forest, use multivariable splitting. Moreover, a simulation study suggests that EIM allows to identify consistently the relevant quantitative and qualitative interaction effects in datasets. Here, the rankings obtained from the EIM value lists for quantitative interaction effects on the one hand and qualitative interaction effects on the other are confirmed to be specific for each of these two types of interaction effects. These results indicate that interaction forest is a suitable tool for identifying and making use of relevant interaction effects in prediction modelling.

Identification of representative trees in random forests based on a new tree-based distance measure
Björn-Hergen Laabs, Inke R. König
Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Germany

In life sciences random forests are often used to train predictive models, but it is rather complex to gain any explanatory insight into the mechanics leading to a specific outcome, which impedes the implementation of random forests in clinical practice. Typically, variable importance measures are used, but they can neither explain how a variable influences the outcome nor find interactions between variables; furthermore, they ignore the tree structure in the forest in total. A different approach is to select a single or a set of a few trees from the ensemble which best represent the forest. It is hoped that by simplifying a complex ensemble of decision trees to a set of a few representative trees, it is possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants.

The intuitive definition of representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. The currently proposed tree-based distance metrics[1] compare trees regarding either the prediction, the clustering in the terminal nodes, or the variables that were used for splitting. Therefore they either need an additional data set for calculating the distances or capture only few aspects of the tree architecture. Thus, we developed a new tree-based distance measure, which does not use an additional data set and incorporates more of the tree structure, by evaluating not only whether a certain variable was used for splitting in the tree, but also where in the tree it was used. We compared our new method with the existing metrics in an extensive simulation study and show that our new distance metric is superior in depicting the differences in tree structures. Furthermore, we found that the most representative tree selected by our method has the best prediction performance on independent validation data compared to the trees selected by other metrics.

[1] Banerjee et al. (2012), Identifying representative trees from ensembles, Statistics in Medicine 31(15), 1601-16

A Machine Learning Approach to Empirical Dynamic Modeling for Biochemical Systems
Kevin Siswandi
University of Freiburg, Germany


In the biosciences, dynamic modeling plays a very important role for understanding and predicting the temporal behaviour of biochemical systems, with wide-ranging applications from bioengineering to precision medicine. Traditionally, dynamic modeling (e.g. in systems biology) is commonly done with Ordinary Differential Equations (ODEs) to predict system dynamics. Such models are typically constructed based on first-principles equations (e.g. Michaelis-Menten kinetics) that are further iteratively modified to be consistent with experiments. Consequently, it could well take several years before a model is quantitatively predictive. Moreover, such ODE models do not scale with increasing amounts of data. At the same time, the demand for high accuracy predictions is increasing in the biotechnology and synthetic biology industry. Here, we investigate a data-driven approach based on machine learning for empirical dynamic modeling that can allow for faster development relative to traditional first-principles modeling, with a particular focus on biochemical systems.


We present a numerical framework for a machine learning approach to discover dynamics from time-series data. The main workflow consists of data augmentation, model training and validation, numerical integration, and model explanation. In contrast to other works, our method does not assume any prior (biological) knowledge or governing equations.

Specifically, by posing it as a supervised learning problem, the dynamics can be reconstructed from time-series measurements through solving the resulting optimisation problem. This is done by embedding it within the classical framework of a numerical method (e.g. linear multi-step method or LMM). We evaluate this approach on canonical systems and complex biochemical systems with nonlinear dynamics.


We show that this method can discover the dynamics of our test systems given enough data. We further find that it could discover bifurcations, is robust to noise, and capable of leveraging additional data to improve its prediction accuracy at scale. Finally, we employ various explainability studies to extract mechanistic insights from the biochemical systems.


By avoiding assumptions about specific mechanisms, we are able to propose a general machine learning workflow. Thus, it can be applied to any new systems (e.g. pathways or hosts), and could be used to capture complex dynamic relationships which are still unknown in the literature. We believe that it has the potential to accelerate the development of predictive dynamic models due to its data-driven approach.