Young talent awards IBS-DR

Chairs: Werner Brannath and Annette Kopp-Schneider


Internal validation for descriptive clustering of gene expression data
Anastasiia Holovchak (Bernd-Streitberg Laureate)
LMU Munich, Germany

Cluster algorithms are often used to analyse gene expression data to partition the genes into homogenous groups based on their expression levels across patients. In practice, one is confronted with a large variety of clustering algorithms and it is often unclear which should be selected. A common procedure consists of testing different algorithms with several input parameters and evaluating them with appropriate internal cluster validation indices. However, it is again unclear which of these indices should be selected.

In this work, I conduct a study that investigates the stability of four internal cluster validation indices (Calinski-Harabasz index, Davies-Bouldin index, Dunn Index, and Average Silhouette Width criterion), in particular their ability to identify clusterings that replicate on independent test data. For the purpose of this study, an example gene expression data set is repeatedly split into a training and a test data set. Several commonly used clustering algorithms such as K-means, agglomerative clustering algorithms (Single Linkage, Complete Linkage, and Average Linkage), and spectral clustering algorithm are applied to the training data. The resulting clusterings are assessed using the four internal validation indices under consideration. The cluster methods are then applied to the test data and the similarity between the index values for the clusterings on the training and on the test data is assessed. I analyse whether the cluster algorithms and input parameters that are indicated as the best choices by the internal validation indices on the training data are also the best choices on the test data. Moreover, the internal validation indices are used to choose the best clustering on the training data and the stability of this selection process is investigated by applying the selected algorithm/parameter setting on the test data (as measured through the adjusted Rand index).

The results may guide the selection of appropriate indices in the considered context of gene expression data. For example, in this study the Dunn index yields very unstable results in terms of the selection of the best input parameter, which can be seen as an inconvenience. In conclusion, the investigated internal cluster validation indices show very different behaviours and one should not put much confidence in a single validation index unless there is evidence – from the literature or from own investigations such as the one presented in this thesis – that it yields meaningful replicable results in the considered context.


Model selection characteristics when using MCP-Mod for dose-response gene expression data
Julia Christin Duda (Bernd-Streitberg Laureate)
TU Dortmund University, Germany

Classical approaches in clinical dose-finding trials rely on pairwise comparisons between doses and placebo. A methodological improvement is the MCP-Mod (Multiple Comparison Procedure and Modeling) approach, originally developed for Phase II trials. MCP-Mod combines multiple comparisons with modeling approaches in a multistage procedure. First, for a set of pre-specified candidate models, it is tested if any dose-response signal is present. Second, considering models with detected signal, either the best model is selected to fit the dose-response curve or model averaging is performed.

We extend the scope of application for MCP-Mod to in-vitro gene expression data and assess its characteristics regarding model selection for concentration gene expression curves. Precisely, we apply MCP-Mod on single genes of a high-dimensional gene expression data set, where human embryonic stem cells were exposed to eight concentration levels of the compound valproic acid (VPA). As candidate models we consider the sigmoid Emax (four-parameter log-logistic), linear, quadratic, Emax, exponential and beta model. Through simulations, we investigate the impact of omitting one or more models from the candidate model set to uncover possibly superfluous models and the precision and recall rates of selected models. Measured by the AIC, all models perform best for a considerable number of genes. For less noisy cases the popular sigmoid Emax model is  frequently selected. For more noisy data, often simpler models like the linear model are selected, but mostly without relevant performance advantage compared to the second-best model. Also, the commonly used Emax model has an unexpected low performance.


Temporal Dynamics in Generative Models
Maren Hackenberg (Bernd-Streitberg Laureate), Harald Binder
Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Germany

Uncovering underlying development patterns in longitudinal biomedical data is a first step towards understanding disease processes, but is complicated by the sparse time grid and individual-specific development patterns that often characterize such data. In epidemiological cohort studies and clinical registries, we are facing the question of what can be learned from the data in an early phase of the study, when only a baseline characterization and one follow-up measurement are available. Specifically, we considered a data scenario where an extensive characterisation is available at a baseline time point for each individual, but only a smaller subset of variables is measured again at an individually differing second time point, resulting in a very sparse (only two time points) and irregular time grid.

Inspired by recent advances that allow to combine deep learning with dynamic modeling, we employed a generative deep learning model to capture individual dynamics in a low-dimensional latent representation as solutions of ordinary differential equations (ODEs). Here, the variables measured only at baseline are used to infer individual-specific ODE parameters.

Additionally, we enriched the information of each individual by linking groups of individuals with similar underlying trajectories, which then serve as proxy information on the common temporal dynamics. Irregular spacing in time can thus be used to gain more information on individual dynamics by leveraging individuals’ similarity. Using simulated data, we showed that the model can recover individual trajectories from linear and non-linear ODE systems with two and four unknown parameters and infer groups of individuals with similar trajectories. The results illustrate that dynamic deep learning approaches can be adapted to such small data settings to provide an individual-level understanding of the dynamics governing individuals’ developments.


Discrete Subdistribution Hazard Models
Moritz Berger (Gustav-Adolf-Lienert Laureate)
Department of Medical Biometry, Informatics and Epidemiology, Rheinische Friedrich-Wilhelms-Universität Bonn, Germany

In many clinical and epidemiological studies the interest is in the analysis of the time T until the occurrence of an event of interest j that may occur along with one or more competing events. This requires suitable techniques for competing risks regression. The key quantity to describe competing risks data is the cumulative incidence function, which is defined in terms of the probability of experiencing j at or before time t.

A popular modeling approach for the cumulative incidence function is the proportional subdistribution hazard model by Fine and Gray (1999), which is a direct modeling approach for the cumulative incidence function of one specific event of interest. A limitation of the subdistribution hazard model is that it assumes continuously measured event times. In practice, however, the exact (continuous) event times are often not recorded. Instead, it may only be known that the events occurred between pairs of consecutive points in time (i.e., within pre-specified follow-up intervals). In these cases, time is measured on a discrete scale.

To address this issue, a technique for modeling subdistribution hazards with right-censored data in discrete time is proposed. The method is based on a weighted maximum likelihood estimation scheme for binary regression and results in consistent and asymptotically normal estimators of the model parameters. In addition, a set of tools to assess the calibration of discrete subdistribution hazard models is developed. They consist of a calibration plot for graphical assessments as well as a recalibration model including tests on calibration-in-the-large and refinement.

The modeling approach is illustrated by an analysis of nosocomial pneumonia in intensive care patients measured on a daily basis.


Netboost: Network Analysis Improves High-Dimensional Omics Analysis Through Local Dimensionality Reduction
Pascal Schlosser1,2 (Gustav-Adolf-Lienert Laureate), Jochen Knaus2, Maximilian Schmutz3, Konstanze Döhner4, Christoph Plass5, Lars Bullinger6, Rainer Claus3, Harald Binder2, Michael Lübbert7,8, Martin Schumacher2
1Institute of Genetic Epidemiology, Faculty of Medicine and Medical Center, University of Freiburg, Germany; 2Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Germany; 3Department of Hematology and Oncology, Augsburg University Medical Center, Augsburg, Germany; 4Department of Internal Medicine III, University Hospital of Ulm, Germany; 5Division of Cancer Epigenomics, German Cancer Research Center, Heidelberg, Germany; 6Hematology, Oncology and Tumor Immunology, Campus Virchow Hospital, Charite University Medicine, Berlin, Germany; 7Department of Hematology-Oncology, Medical Center, Faculty of Medicine, University of Freiburg, Germany; 8German Consortium for Translational Cancer Research (DKTK), Freiburg, Germany

State-of-the art selection methods fail to identify weak but cumulative effects of features found in many high-dimensional omics datasets. Nevertheless, these features play an important role in certain diseases. We present Netboost, a three-step dimension reduction technique. First, a boosting- or Spearman-correlation-based filter is combined with the topological overlap measure to identify the essential edges of the network. Second, sparse hierarchical clustering is applied on the selected edges to identify modules and finally module information is aggregated by the first principal components. We demonstrate the application of the newly developed Netboost in combination with CoxBoost for survival prediction of DNA methylation and gene expression data from 180 acute myeloid leukemia (AML) patients and show, based on cross-validated prediction error curve estimates, its prediction superiority over variable selection on the full dataset as well as over an alternative clustering approach. The identified signature related to chromatin modifying enzymes was replicated in an independent dataset, the phase II AMLSG 12-09 study. In a second application we combine Netboost with Random Forest classification and improve the disease classification error in RNA-sequencing data of Huntington’s disease mice. Netboost is a freely available Bioconductor R package for dimension reduction and hypothesis generation in high-dimensional omics applications.