Young Statisticians Session

Predictions by random forests – confidence intervals and their coverage probabilities
Diana Kormilez, Björn-Hergen Laabs, Inke R. König
Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Germany

Random forests are a popular supervised learning method. Their main purpose is the robust prediction of an outcome based on a learned set of rules. To evaluate the precision of predictions their scattering and distributions are important. In order to quantify this, 95 % confidence intervals for the predictions can be generated using suitable variance estimators. However, these variance estimators may be under- or overestimated and the confidence intervals thus cover ranges either too small or too large, which can be evaluated by estimating coverage probabilities through simulations. The aim of our study was to examine coverage probabilities for two popular variance estimators for predictions made by random forests, the infinitesimal jackknife according to Wager et al. (2014) and the fixed-point based variance estimator according to Mentch and Hooker (2016). We performed a simulation study considering different scenarios with varying sample sizes and various signal-to-noise ratios. Our results show that the coverage probabilities based on the infinitesimal jackknife are lower than the desired 95 % for small data sets and small random forests. On the other hand, the variance estimator according to Mentch and Hooker (2016) leads to overestimated coverage probabilities. However, a growing number of trees yields decreasing coverage probabilities for both methods. A similar behavior was observed when using real datasets, where the composition of the data and the number of trees influence the coverage probabilities. In conclusion, we observed that the relative performance of one variance estimation method over the other depends on the hyperparameters used for training the random forest. Likewise, the coverage probabilities can be used to evaluate how well the hyper-parameters were chosen and whether the data set requires more pre-processing.

Mentch L, Hooker G (2016): Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests. J. Mach. Learn. Res., 17:1-41.

Wager S, Hastie T, Efron B (2014): Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. J. Mach. Learn. Res., 15:1625-1651.


Over-optimism in benchmark studies and the multiplicity of analysis strategies when interpreting their results
Christina Nießl1, Moritz Herrmann2, Chiara Wiedemann1, Giuseppe Casalicchio2, Anne-Laure Boulesteix1
1Institute for Medical Informatics, Biometry and Epidemiology, University of Munich (Germany); 2Department of Statistics, University of Munich (Germany)

In recent years, the need for neutral benchmark studies that focus on the comparison of statistical methods has been increasingly recognized. At the interface between biostatistics and bioinformatics, benchmark studies are especially important in research fields involving omics data, where hundreds of articles present their newly introduced method as superior to other methods.

While general advice on the design of neutral benchmark studies can be found in recent literature, there is always a certain amount of flexibility that researchers have to deal with. This includes the choice of datasets and performance metrics, the handling of missing values in case of algorithm failure (e.g., due to non-convergence) and the way the performance values are aggregated over the considered datasets. Consequently, different choices in the benchmark design may lead to different results of the benchmark study.

In the best-case scenario, researchers make well-considered design choices prior to conducting a benchmark study and are aware of this issue. However, they may still be concerned about how their choices affect the results. In the worst-case scenario, researchers could (most often subconsciously) use this flexibility and modify the benchmark design until it yields a result they deem satisfactory (for example, the superiority of a certain method). In this way, a benchmark study that is intended to be neutral may become biased.

In this paper, we address this issue in the context of benchmark studies based on real datasets using an example benchmark study, which compares the performance of survival prediction methods on high-dimensional multi-omics datasets. Our aim is twofold. As a first exploratory step, we examine how variable the results of a benchmark study are by trying all possible combinations of choices and comparing the resulting method rankings. In the second step, we propose a general framework based on multidimensional unfolding that allows researchers to assess the impact of each choice and identify critical choices that substantially influence the resulting method ranking. In our example benchmark study, the most critical choices were the considered datasets and the performance metric. However, in some settings, we observed that the handling of missing values and the aggregation of performances over the datasets can also have a great influence on the results.


Variable Importance Measures for Functional Gradient Descent Boosting Algorithm
Zeyu Ding
TU Dortmund, Germany

With the continuous growth of data dimensions and the improvement of computing power in contemporary statistics, the number of variables that can be included in a model is increasing significantly. Therefore, when designing a model, selecting truly informative variables from a bunch of messy variables and ranking them according to their importance becomes a core problem of statistics. Appropriate variable selection can reduce overfitting and leads to a more interpretable model. Traditional methods use the decomposition of R2, step-wise Akaike Information Criterion (AIC)-based variable selection, or regularization based on lasso and ridge. In ensemble algorithms, the variable importance is often calculated separately by the permutation methods. In this contribution, we propose two new stable and discriminating variable importance measures for the functional gradient descent boosting algorithm (FGDB). The first one calculates the l2 norm contribution of a variable while the second one calculates the risk reduction of the variables in every iteration. Our proposal is demonstrated in both simulation and real data examples. We show that the two new methods are more effective in automatically selecting those truly important variables in different data scenarios than the traditional selection frequency measures used in FGDB algorithm. This holds for both linear and non-linear models under different data scenarios.

keywords: variable importance measures, variable selection, functional gradient boosting, component-wise regression, generalized additive models.


Quality control in genome-wide association studies revisited: a critical evaluation of the standard methods
Hanna Brudermann1, Tanja K. Rausch1,2, Inke R. König1
1Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Germany, Germany; 2Department of Pediatrics, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany

Genome-wide association studies (GWAs) investigating the relationship between millions of genetic markers and a clinically relevant phenotype were originally based on the common disease – common variant assumption, thus aiming at identifying a small number of common genetic loci as cause for common diseases. Given the enormous cost reduction in the acquisition of genomic data, it is not surprising that since the first known GWA by Klein et al. (2005), this study type was established as a standard method. However, since even low error frequencies can distort association results, extensive and accurate quality control of the given data is mandatory. In recent years, the focus of GWAs has shifted, and the task is no longer primarily the discovery of common genetic loci. Also, with increasing sample sizes and (mega-)meta-analyses of GWAs, it is hoped that loci with small effects can be identified. Furthermore, it has become popular to aggregate all genomic information, even loci with very small effects and frequencies, into genetic risk prediction scores, thus increasing the requirement for high-quality genetic data.

However, after extensive discussions about standards for quality control in GWAs in the early years, further work on how to control data quality and adapt data cleaning to new GWAs aims has become scarce.

The aim of this study was to perform an extensive literature review to evaluate currently applied quality control criteria and their justification. Building on the findings from the literature search, a workflow was developed to include justified quality control steps, keeping in mind that a strict quality control, which removes all data with a high risk of bias, always carries the risk that the remaining data is too homogeneous to make small effects visible. This workflow is subsequently illustrated using a real data set.

Our results show that in most published GWAs, no scientific reasons for the applied quality steps are given. Cutoffs for the most common quality measures are mostly not explained. Especially the principal component analysis and the test for deviation from Hardy-Weinberg equilibrium are frequently used as quality criteria in many GWAs without analyzing the existing conditions exactly and adjusting the quality control accordingly.

It is pointed out that researchers still have to decide between universal and individual parameters and therefore between optimal comparability to other analyses and optimal conditions within the specific study.


Which Test for Crossing Survival Curves? A User’s Guide
Ina Dormuth1, Tiantian Liu2,4, Jin Xu2, Menggang Yu3, Markus Pauly1, Marc Ditzhaus1
1TU Dortmund, Deutschland; 2East China Normal University, China; 3University of Wisconsin-Madison, USA; 4Technion ‐ Israel Institute of Technology, Israel

Knowledge transfer between statisticians developing new data analysis methods, and users is essential. This is especially true for clinical studies with time-to-event endpoints. One of the most common problems is the comparison of survival in two-armed trials. The log-rank test is still the gold standard for answering this question. However, in the case of non-proportional hazards, its informative value may decrease. In the meantime several extensions have been developed to solve this problem. Since non-proportional or even intersecting survival curves are common in oncology, e.g. in immunotherapy studies, it is important to identify the most appropriate methods and to draw attention to their existence. Therefore, it is our goal to simplify the choice of a test to detect differences in survival rate in case of crossings. To this end, we reviewed 1,400 recent oncological studies. Limiting our analysis to intersecting survival curves and non-significant log-rank tests for a sufficient number of observed events we reconstructed the data sets using a state-of-the-art reconstruction algorithm. To ensure reproductive quality, only publications with a published number of risk at multiple points in time, sufficient print quality, and a non-informative censoring pattern were included. After elimination of papers on the basis of the exclusion criteria mentioned above, we compared the p-values of the log-rank and the Peto-Peto test as references and compare them with nine different tests for non-proportional or even crossing hazards. It is shown that tests designed to detect crossing hazards are advantageous and provide guidance in choosing a reasonable alternative to the standard log-rank test. This is followed by a comprehensive simulation study and the generalization of one of the test methods to the multi-sample case.