Track: Track 3

Non Clinical Statistics I

Chairs: Hannes-Friedrich Ulbrich and Frank Konietschke

Can statistics save preclinical research?
Ulrich Dirnagl
Charité / Berlin Institute of Health, Deutschland


Meggie Danziger1,2, Ulrich Dirnagl1,2, Ulf Toelch2
1Charité – Universitätsmedizin Berlin, Germany; 2BIH QUEST Center for Transforming Biomedical Research

Low statistical power in preclinical experiments has been repeatedly pointed out as a roadblock to successful replication and translation. To increase reproducibility of preclinical experiments under ethical and budget constraints, it is necessary to devise strategies that improve the efficiency of confirmatory studies.

To this end, we simulate two preclinical research trajectories from the exploratory stage to the results of a within-lab replication study based on empirical pre-study odds. In a first step, a decision is made based on exploratory data whether to continue to a replication. One trajectory (T1) employs the conventional significance threshold for this decision. The second trajectory (T2) uses a more lenient threshold based on an a priori determined smallest effect size of interest (SESOI). The sample size of a potential replication study is calculated via a standard power analysis using the initial exploratory effect size (T1) or using a SESOI (T2). The two trajectories are compared regarding the number of experiments proceeding to replication, number of animals tested, and positive predictive value (PPV).

Our simulations show that under the conventional significance threshold, only 32 percent of the initial exploratory experiments progress to the replication stage. Using the decision criterion based on a SESOI, 65 percent of initial studies proceed to replication. T1 results in the lowest number of animals needed for replication (n = 7 per group) but yields a PPV that is below pre-study odds. T2 increases PPV above pre-study odds while keeping sample size at a reasonably low number (n = 23 per group).

Our results reveal that current practice, represented by T1, impedes efforts to replicate preclinical experiments. Optimizing decision criteria and experimental design by employing easily applicable variations as shown in T2 keeps tested animal numbers low while generating more robust preclinical evidence that may ultimately benefit translation.

Information sharing across genes for improved parameter estimation in concentration-response curves
Franziska Kappenberg, Jörg Rahnenführer
TU Dortmund University, Germany

Technologies for measuring high-dimensional gene expression values for tens of thousands of genes simultaneously are well established. In toxicology, for estimating concentration-response curves, such data can be used to understand the biological processes initiated at different concentrations. Increasing the number of concentrations or the number of replicates per concentration can improve the accuracy of the fit, but causes critical additional costs. A statistical approach to obtain higher-quality fits is to exploit similarities between high-dimensional concentration-gene expression data. This idea can also be called information sharing across genes. Parameters of the concentration-response curves can be linked, according to a priori assumptions or estimates of the distributions of the parameters, in a Bayesian framework.

Here, we consider the special case of the sigmoidal 4pLL model for estimating the curves associated with single genes, and we are interested in the EC50 value of the curve, i.e. the concentration at which the half-maximal effect is reached. This value is a parameter of the 4pLL model and can be considered a reasonable indicator for a relevant expression effect of the corresponding gene. We introduce an empirical Bayes method for information sharing across genes in this situation, by modelling the distribution of the EC50 values across all genes. Based on this distribution, for each gene a weighted mean of the individually estimated parameter and the overall mean of the estimated parameters of all genes is calculated. In other words, parameters are shrunk towards an overall mean. We evaluate our approach using several simulation studies that differ with respect to their degree of assumptions made for the distribution of the EC50 values. Finally, the method is also applied to a real gene expression dataset to demonstrate the influence of the analysis strategy on the results.

An intuitive time-dose-response model for cytotoxicity data with varying exposure times
Julia Christin Duda, Jörg Rahnenführer
TU Dortmund University, Germany

Modeling approaches for dose-response or concentration-response analyses are slowly becoming more popular in toxicological applications. For cytotoxicity assays, not only the concentration but also the exposure or incubation time of the compound administered to cells can be varied and might have influence on the response. A popular concentration-response model is the four-parameter log-logistic (4pLL) or, more specific and tailored to cytotoxicity data, the two-parameter log-logistic (2pLL) model. Both models, however, model the response based on the concentration only.

We propose a two-step procedure and a new time-concentration-response model for cytotoxicity data in which both concentration and exposure time are varied. The parameter of interest for the estimation is the EC50 value, i.e. the concentration at which half of the maximal effect is reached. The procedure consists of a testing step and a modeling step. In the testing step, a nested ANOVA test is performed to decide if the exposure time has an effect on the shape of the concentration-response curve. If no effect is identified then a classical 2pLL model is fitted. Otherwise, a new time-concentration-response model called td2pLL is fitted. In this model, we incorporate exposure time information into the 2pLL model by making the EC50 parameter dependent on the exposure time.

In simulation studies inspired by and based on a real data set, we compare the proposed procedure against various alternatives with respect to the precision of the estimation of the EC50 value. In all simulations, the new procedure provides estimates with higher or comparable precision, which demonstrates its universal applicability in corresponding toxicological experiments. In addition, we show that the use of optimal designs for cytotoxicity experiments further improves the EC50 estimates throughout all considered scenarios while reducing numerical problems. In order to facilitate the application in toxicological practice, the developed methods will be made available to practitioners via the R package td2pLL and a corresponding vignette that demonstrates the application on an example dataset.

Nonparametric Statistics and Multivariate Analysis

Chairs: Frank Konietschke and Markus Pauly

Marginalized Frailty-Based Illness-Death Model: Application to the UK-Biobank Survival Data
Malka Gorfine
Tel Aviv University, Israel

The UK Biobank is a large-scale health resource comprising genetic, environmental, and medical information on approximately 500,000 volunteer participants in the United Kingdom, recruited at ages 40–69 during the years 2006–2010. The project monitors the health and well-being of its participants. This work demonstrates how these data can be used to yield the building blocks for an interpretable risk-prediction model, in a semiparametric fashion, based on known genetic and environmental risk factors of various chronic diseases, such as colorectal cancer. An illness-death model is adopted, which inherently is a semi-competing risks model, since death can censor the disease, but not vice versa. Using a shared-frailty approach to account for the dependence between time to disease diagnosis and time to death, we provide a new illness-death model that assumes Cox models for the marginal hazard functions. The recruitment procedure used in this study introduces delayed entry to the data. An additional challenge arising from the recruitment procedure is that information coming from both prevalent and incident cases must be aggregated. Lastly, we do not observe any deaths prior to the minimal recruitment age, 40. In this work, we provide an estimation procedure for our new illness-death model that overcomes all the above challenges.

Distribution-free estimation of the partial AUC in diagnostic studies
Maximilian Wechsung
Charité – Universitätsmedizin Berlin, Germany

The problem of partial area under the curve (pAUC) estimation arises in diagnostic studies in which not the whole receiver operating characteristic (ROC) curve of a diagnostic test with continuous outcome can be evaluated. Typically, the investigator is bound by economical as well as ethical considerations to analyze only that part of the ROC curve which includes true positive rates and false positive rates above and below certain thresholds, respectively. The pAUC is the area under this partial ROC curve. It can be used to evaluate the performance of a diagnostic test with continuous outcome. In our talk, we consider a distribution-free estimator of the pAUC and establish its asymptotic distribution. The results can be used to construct statistical tests to compare the performance of different diagnostic tests.

Ranking Procedures for the Factorial Repeated Measures Design with Missing Data – Estimation, Testing and Asymptotic Theory
Kerstin Rubarth, Frank Konietschke
Charité Berlin, Germany

A commonly used design in health, medical and biomedical research is the repeated measures design. Often, a parametric model is used for the analysis of such data. However, if sample size is rather small or if data is skewed or is on an ordinal scale, a nonparametric approach would fit the data better than a classic parametric approach, e.g. linear mixed models. Another issue, that naturally arises when dealing with clinical or pre-clinical data, is the occurrence of missing data. Most methods can only use a complete data set, if no imputation technique is applied. The newly developed ranking procedure is a flexible method for general non-normal, ordinal, ordered categorical and even binary data and uses in case of missing data all available information instead of only the information obtained from complete cases. The hypotheses are defined in terms of the nonparametric relative effect and can be tested by using quadratic test procedures as well as the multiple contrast test procedure. Additionally, the framework allows for the incorporation of clustered data within the repeated measurements. An example for clustered data are animal studies, where several animals share the same cage and are therefore clustered within a cage. Simulation studies indicate a good performance in terms of the type-I error rate and the power under different alternatives with a missing rate up to 30%, also under non-normal data. A real data example illustrates the application of the proposed methodology.

A cautionary tale on using imputation methods for inference in a matched pairs design.
Burim Ramosaj, Lubna Amro, Markus Pauly
TU Dortmund University, Germany

Imputation procedures in biomedical fields have turned into statistical practice, since further analyses can be conducted ignoring the former presence of missing values. In particular, non-parametric imputationschemes like the random forest or a combination with the stochastic gradient boosting have shown favorable imputation performance compared to the more traditionally used MICE procedure. However, their effect on valid statistical inference has not been analyzed so far. This gap is closed by investigating their validity for inferring mean differences in incompletely observed pairs while opposing them to a recent approach that only works with the given observations at hand. Our findings indicate that machine learning schemes for (multiply) imputing missing values heavily inflate type-I-error in small to moderate matched pairs, even after modifying the test statistics using Rubin’s multiple imputation rule. In addition to an extensive simulation study, an illustrative data example from a breast cancer gene study has been considered.


Chairs: Michael Altenbuchinger and Klaus Jung

AI Models for Multi-Modal Data Integration
Holger Fröhlich
Fraunhofer Gesellschaft e.V.

Precision medicine aims for the delivery of the right treatment for the right patients. One important goal is therefore to identify molecular sub-types of diseases, which opens the opportunity for a better targeted therapy of disease in the future. In that context high throughput omics data has been extensively used. However, analysis of one type of omics data alone provides only a very limited view on a complex and systemic disease such as cancer. Correspondingly, parallel analysis of multiple omics data types is needed and employed more and more routinely. However, leveraging the full potential of multi-omics data requires statistical data fusion, which comes along with a number of unique challenges, including differences in data types (e.g. numerical vs discrete), scale, data quality and dimension (e.g. hundreds of thousands of SNPs vs few hundred miRNAs).

In the first part of my talk I will focus on Pathway based Multi-modal AutoEncoders (PathME) as one possible approach for multi-omics data integration. PathME relies on a multi-modal sparse denoising autoencoder architecture to embed multiple omics types that can be mapped to the same biological pathway. We show that sparse non-negative matrix factorization applied to such embeddings result into well discriminated disease subtypes in several cancer types, which show clinically distinct features. Moreover, each of these subtypes can be associated to subtype-specific pathways, and for each of these pathways it is possible to disentangle the influence of individual omics features, hence providing a rich interpretation.

Going one step further in the second part of my talk I will focus on Variational Autoencoder Modular Bayesian Networks (VAMBN) as merger of Bayesian Networks and Variational Autoencoders to model multiple data modalities (including clinical assessment scores), also in a longitudinal manner. I will specifically demonstrate the application of VAMBN for modeling entire clinical studies in Parkinson’s Disease (PD). Since VAMBN is generative the model can be used to simulate synthetic patients, also under counterfactual scenarios (e.g. age shift by 20 years, modification of disease severity at baseline), which could facilitate the design of clinical studies, sharing of data under probabilistic privacy guarantees and eventually allowing for finding “patients-like-me” within a broader, virtually merged meta-cohort.

NetCoMi: Network Construction and Comparison for Microbiome Data in R
Stefanie Peschel1, Christian L. Müller2,3,4, Erika von Mutius1,5,6, Anne-Laure Boulesteix7, Martin Depner1
1Institute of Asthma and Allergy Prevention, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany; 2Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany; 3Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany; 4Center for Computational Mathematics, Flatiron Institute, New York, USA; 5Dr von Hauner Children’s Hospital, Ludwig-Maximilians-Universität München, Munich, Germany; 6Comprehensive Pneumology Center Munich (CPC-M), Member of the German Center for Lung Research, Munich, Germany; 7Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig-Maximilians-Universität München, Munich, Germany


Network analysis methods are suitable for investigating the microbial interplay within a habitat. Since microbial associations may change between conditions, e.g. between health and disease state, comparing microbial association networks between groups might be useful. For this purpose, the two networks are constructed separately, and either the resulting associations themselves or the network’s properties are compared between the two groups.

Estimating associations for sequencing data is challenging due to their special characteristics – that is, sparsity with a high number of zeros, high dimensionality, and compositionality. Several association measures taking these features into account have been published during the last decade. Furthermore, several network analysis tools, methods for comparing network properties among two or more groups as well as approaches for constructing differential networks are available in the literature. However, no unifying tool for the whole process of constructing, analyzing and comparing microbial association networks between groups is available so far.


We provide the R package „NetCoMi“ implementing this whole workflow starting with a read count matrix originating from a sequencing process, to network construction, up to a statement whether single associations, local network characteristics, the determined clusters, or even the overall network structure differs between the groups. For each of the aforementioned steps, a selection of existing methods suitable for the application on microbiome data is included. Especially the function for network construction contains many different approaches including methods for treating zeros in the data, normalization, computing microbial associations, and sparsifying the resulting association matrix. NetCoMi can either be used for constructing, analyzing and visualizing a single network, or for comparing two networks in a graphical as well as a quantitative manner, including statistical tests.


We illustrate the application of our package using a real data set from GABRIELA study [1] to compare microbial associations in settled dust from children’s rooms between samples from two study centers. The examples demonstrate how our proposed graphical methods uncover genera with different characteristics (e.g. a different centrality) between the groups, similarities and differences between the clusterings, as well as differences among the associations themselfes. These descriptive findings are confirmed by a quantitative output including a statement whether the results are statistically significant.

[1] Jon Genuneit, Gisela Büchele, Marco Waser, Katalin Kovacs, Anna Debinska, AndrzejBoznanski, Christine Strunz-Lehner, Elisabeth Horak, Paul Cullinan, Dick Heederik, et al.The gabriel advanced surveys: study design, participation and evaluation of bias.Paediatric and Perinatal Epidemiology, 25(5):436–447, 2011.

Evaluation of augmentation techniques for high-dimensional gene expression data for the purpose of fitting artificial neural networks
Magdalena Kircher, Jessica Krepel, Babak Saremi, Klaus Jung
University of Veterinary Medicine Hannover, Foundation, Germany


High-throughput transcriptome expression data from DNA microarrays or RNA-seq are regularly checked for their ability to classify samples. However, with further densification of transcriptomic data and a low number of observations – due to a lack of available biosamples, prohibitive costs and ethical reasons – the ratio between the number of variables and available observations is usually very large. As a consequence, classifier performance estimated from training data often tends to be overrated and little robust. It has been demonstrated in many applications that data augmentation can improve the robustness of artificial neural networks. Data augmentation on high-dimensional gene expression data has, however, been very little studied so far.


We investigate the applicability and capacity of two data augmentation approaches including generative adversarial networks (GAN), which have been widely used for augmenting image datasets. Comparison of augmentation methods is carried out in public example data sets from infection research. Besides neural networks, we evaluate the augmentation techniques on the performance of linear discriminant analysis and support vector machines.

Results and Outlook:

First results of a 10-fold cross validation show increased accuracy, sensitivity, specificity and predictive values when using augmented data sets compared to classifier models based on the original data only. A simple augmentation approach by mixed observations shows a similar performance as the computational more expensive approach with GANs. Further evaluations are currently running to better understand the detailed performance of the augmentation techniques.

Comparison of merging strategies for building machine learning models on multiple independent gene expression data sets
Jessica Krepel, Magdalena Kircher, Moritz Kohls, Klaus Jung
University of Veterinary Medicine Hannover, Foundation, Germany


Microarray experiments and RNA-seq experiments allow the simultaneous collection of gene expression data from several thousand genes which can be used in a wide range of biological questions. Nowadays, there are gene expression data available in public databases for many biological and medical research questions. Oftentimes, several independent studies are performed on the same or similar research question. There are several benefits of combining these studies compared to individual analyses. Several approaches for combining independent data sets of gene expression data have been proposed already in the context of differential gene expression analysis and gene set enrichment analysis. Here, we want to compare different strategies for combining independent data sets for the purpose of classification analysis.


We only considered the two-group design, e.g. with class labels diseased and healthy. At different stages of the analysis, the information of the individual studies can be aggregated. We examined three different merging pipelines with regard to the stage of the analysis at which merging is conducted, namely the direct merging of the data sets (strategy A), the merging of the trained classification models (strategy B), and the merging of the classification results (strategy C). We combined the merging pipelines with different methods for classification, linear discriminant analysis (LDA), support vector machines (SVM), and artificial neural networks (ANN). Within each merging strategy, we performed a differential gene expression analysis for dimension reduction to select a set of genes that we then used as feature subset in the classification. We trained and evaluated the classification model on several data subsets in form of a 10-fold cross validation. We first performed a simulation study with pure artificial data, and secondly a study based on a real world data set from the public data repository ArrayExpress that we artificially split into two studies.


With respect to classification accuracy, we found that the strategy of data merging outperformed the strategy of results merging in most of our simulation scenarios with artificial data. Especially when the number of studies is high and the differentiability between the groups is low, strategy A appears as the best performing one. Strategy A performed particularly better than the other two merging approaches when four independent studies were aggregated compared to scenarios with only two independent studies.

Evaluating the quality of synthetic SNP data from deep generative models under sample size constraints
Jens Nußberger, Frederic Boesel, Stefan Lenz, Harald Binder, Moritz Hess
Universitätsklinikum Freiburg, Germany

Synthetic data such as generated by deep generative models are increasingly considered for exchanging biomedical data, such as single nucleotide polymorphism (SNP) data, under privacy constraints. This requires that the employed model did sufficiently well learn the joint distribution of the data. A major limiting factor here is the number of available empirical observations which can be used for training. Until now, there is little evidence of how well the predominant generative approaches, namely variational autoencoders (VAEs), deep Boltzmann machines (DBMs) and generative adversarial networks (GANs) learn the joint distribution of the objective data under sample size constraints. Using simulated SNP data and data from the 1000 genomes project, we here provide results from an in-depth evaluation of VAEs, DBMs and GANs. Specifically, we investigate, how well pair-wise co-occurrences of variables in the investigated SNP data, quantified as odds ratios (ORs), are recovered in the synthetic data generated by the approaches. For simulated as well as the 1000 genomes SNP data, we observe that DBMs generally can recover structure for up to 300 SNPs. However, we also observe a tendency of over-estimating ORs when the DBMs are not carefully tuned. VAEs generally get the direction and relative strength of pairwise ORs right but generally under-estimate their magnitude. GANs perform well only when larger sample sizes are employed and when there are strong pairwise associations in the data. In conclusion, DBMs are well suited for generating synthetic observations for binary omics data, such as SNP data, under sample size constraints. VAEs perform superior at smaller sample sizes but are limited with respect to learning the absolute magnitude of pairwise associations between variables. GANs require large amounts of training data and likely require a careful selection of hyperparameters.


Die AG-Sitzungen finden im virtuellen Konferenzgebäude statt. Die Raumbezeichnungen beziehen sich auf die freien Gather-Räume G1-G4 (siehe auch Karte des virtuellen Konferenzgebäudes).

Zeit: 14:50-15:30

  • Bayes-Methodik (Raum G1)
  • Lehre und Didaktik der Biometrie (Raum G2)
  • Statistische Methoden in der Epidemiologie (Raum G3)
  • Statistische Methoden in der Medizin (Raum G3)

Zeit: 16:50-17:30

Die AG-Nachwuchs hält ihre AG-Sitzung am 18.03 von 17:00-18:00 per Zoom ab. Für nähere Informationen kontaktieren Sie bitte Stefanie Peschel (


Die AG-Sitzungen finden im virtuellen Konferenzgebäude statt. Die Raumbezeichnungen beziehen sich auf die freien Gather-Räume G1-G4 (siehe auch Karte des virtuellen Konferenzgebäudes).

Zeit: 14:50-15:30

  • Bayes-Methodik (Raum G1)
  • Lehre und Didaktik der Biometrie (Raum G2)
  • Statistische Methoden in der Epidemiologie (Raum G3)
  • Statistische Methoden in der Medizin (Raum G3)

Zeit: 16:50-17:30

Die AG-Nachwuchs hält ihre AG-Sitzung am 18.03 von 17:00-18:00 per Zoom ab. Für nähere Informationen kontaktieren Sie bitte Stefanie Peschel (

Survival and Event History Analysis II

Chairs: Annika Hoyer and Oliver Kuß

Assessment of methods to deal with delayed treatment effects in immunooncology trials with time-to-event endpoints
Rouven Behnisch, Johannes Krisam, Meinhard Kieser
Institute of Medical Biometry and Informatics, University of Heidelberg, Germany

In cancer drug research and development, immunotherapy plays an ever more important role. A common feature of immunotherapies is a delayed treatment effect which is quite challenging when dealing with time-to-event endpoints [1]. In case of time-to-event endpoints, regulatory authorities often require a log-rank test, which is the standard statistical method. The log-rank test is known to be most powerful under proportional-hazards alternatives but suffers a substantial loss in power if this assumption is violated. Hence, a rather long follow-up period is required to detect a significant effect in immunooncology trials. For that reason, the question arises whether methods exist that are more susceptible to delayed treatment effects and that can be applied early on to generate evidence anticipating the final decision of the log-rank test to reduce the trial duration without inflation of the type I error. Alternative methods include, for example, weighted log-rank statistics with weights that can either be fixed at the design stage of the trial [2] or chosen based on the observed data [3] or tests based on the restricted mean survival time [4], survival proportions, accelerated failure time (AFT) models or additive hazard models.

We evaluate and compare these different methods systematically with regard to type I error control and power in the presence of delayed treatment effects. Our simulation study includes aspects such as different censoring rates and types, different times of delay, and different failure time distributions. First results show that most methods achieve type I error rate control and that, by construction, the weighted log-rank tests which place more weight on late time points have a greater power to detect differences when the treatment effect is delayed. It is furthermore investigated whether and to what extent these methods can be applied at an early stage of the trial to predict the decision of the log-rank test later on.


[1] T. Chen (2013): Statistical issues and challenges in immuno-oncology. Journal for ImmunoTherapy of Cancer 1:18
[2] T.R. Fleming and D.P. Harrington (1991): Counting Processes and Survival Analysis. New York [u.a.]: Wiley-Interscience Publ.
[3] D. Magirr and C. Burman (2019): Modestly weighted logrank tests. Statistics in Medicine 38(20):3782-3790.
[4] P. Royston and M.K.B. Parmar (2013): Restricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome. BMC Medical Research Methodology 13(1):152.

Sampling designs for rare time-dependent exposures – A comparison of the nested exposure case-control design and exposure density sampling
Jan Feifel1, Maja von Cube2, Martin Wolkewitz2, Jan Beyersmann1, Martin Schumacher2
1Institute of Statistics, Ulm University, Germany; 2Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center University of Freiburg, Germany

Hospital-acquired infections increase both morbidity and mortality of hospitalized patients. Researchers interested in the effect of these time-dependent infections on the length-of-hospital stay, as a measure of disease burden, face large cohorts with possibly rare exposures.

For large cohort studies with rare outcomes nested case-control designs are favorable due to the efficient use of limited resources. Here, nested case-control designs apply but do not lead to reduced sample sizes, because the outcome is not necessarily rare, but the exposure is. Recently, exposure density sampling (EDS)[1] and nested exposure case-control design (NECC) [2] have been proposed to sample for a rare time-dependent exposure in cohorts with a survival endpoint. The two designs differ in the time point of sampling.

Both designs enable efficient hazard ratio estimation by sampling all exposed individuals but only a small fraction of the unexposed ones. Moreover, they account for time-dependent exposure to avoid immortal time bias. We investigate and compare their performance using data of patients hospitalized in the neuro-intensive care unit at the Burdenko Neurosurgery Institute (NSI) in Moscow, Russia. The impact of different types of hospital-acquired infections with different prevalence on length-of-stay is considered. Additionally, inflation factors, a primary performance measure, are discussed. All presented methods will be compared to the gold-standard Cox model on the full cohort. We enhance both designs to allow for a competitive analysis of combined and competing endpoints. Additionally, these designs substantially reduce the amount of necessary information compared to the full cohort approach.

Both EDS and NECC are capable of analyzing time-to-event data by simultaneously accounting for rare time-dependent exposure and result in affordable sample sizes. EDS outperforms the NECC concerning efficiency and accuracy in most considered settings for combined endpoints. For competing risks, however, a tailored NECC shows more appealing results.

[1] K. Ohneberg, J. Beyersmann and M. Schumacher (2019): Exposure density sampling: Dynamic matching with respect to a time-dependent exposure. Statistics in Medicine, 38(22):4390-4403.

[2] J. Feifel, M. Gebauer, M. Schumacher and J. Beyersmann (2020): Nested exposure case-control sampling: a sampling scheme to analyze rare time-dependent exposures. Lifetime Data Analysis, 26:21-44.

Tumour-growth models improve progression-free survival estimation in the presence of high censoring-rates
Gabriele Bleckert, Hannes Buchner
Staburo GmbH, Germany

In oncology, reliable estimates of progression-free survival (PFS) are of highest importance because of high failure rates of phase III trials (around 60%). However, PFS estimations on early readouts with less than 50% of events observed do not use all available information from tumour measurements over time.

We project the PFS-event of each censored patient by using a mixed model [2] describing the tumour burden over time. RECIST-criteria are applied on estimated patient-specific non-linear tumour-trajectories to calculate the projected time-to-progression.
PFS is compared between test and reference by hazard ratios (HR). Several phase III and II simulations with 1000 runs each with 2000 or 80 patients, 6 months accrual and 2 (scenario-1) or 6 months (scenario-2) follow-up were performed. All simulations are based on a published optimal parameterisation [1] of tumour-growth in non-small-cell lung cancer (NSCLC) which implies a time-dependent HR.

The classical PFS estimation resulted in a HR of 0.34 (95%-percentiles: 0.29-0.40) for scenario-1 and 0.52 (0.47-0.58) for scenario-2 compared to a predicted HR of 0.77 for both scenarios (0.69-0.85 and 0.69-0.84), while the overall true HR (over ten years) was 0.78 (0.69-0.85). For 6, 12 and 120 months the time varying HRs were 0.41 (0.36-0.47), 0.60 (0.54-0.66) and 0.77 (0.69-0.85). The classical PFS estimation for phase II showed HRs from 0.52 to 0.61 compared to predicted HRs between 0.71 and 0.77.

Tumour-growth models improve PFS estimations in the presence of high censoring-rates as they consistently provide far better estimates of the overall true HR in phase III and II trials.

[1] M. Reck, A. Mellemgaard, S. Novello, PE. Postmus, B. Gaschler-Markefski, R. Kaiser, H. Buchner: Change in non-small-cell lung cancer tumor size in patients treated with nintedanib plus docetaxel: analyses from the Phase III LUME-Lung 1 study, OncoTargets and Therapy 2018:11 4573–4582

[2] Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982 Dec;38(4):963-74. PMID: 7168798.

Evaluation of event rate differences using stratified Kaplan-Meier difference estimates with Mantel-Haenszel weights
Hannes Buchner1, Stephan Bischofberger1, Rainer-Georg Goeldner2
1Staburo, Germany; 2Boehringer Ingelheim, Germany

The assessment of differences in event rates is a common endeavor in the evaluation of the efficacy of new treatments in clinical trials. We investigate the performance of different hypothesis tests for cumulative hospitalization or death rates of Covid-19 in order to reliably determine the efficacy of a novel treatment. The focus of the evaluation was on the comparison of event rates via Kaplan-Meier estimates for a pre-specified day and the aim to reduce sampling error, hence we examine different stratum weights for a stratified Z-test for Kaplan-Meier differences. The simulated data is calibrated from recent research on neutralizing antibodies treatment for Covid-19 with 2, 4, and 6 strata of different size and prevalence, and we investigate the effects of overall event rates ranging from 2% to 20%. We simulate 1000 patients and compare the results of 1000 simulation runs. Our simulation study shows superior performance of Mantel-Haenszel-type weights [Greenland & Robins (1985), Biometrics 41, 55-68] over inverse variance weights – in particular for unequal stratum sizes and very low event rates in some strata as common in COVID-19 treatment studies. The advantage of this approach is a larger power of the test (e.g. 79% instead of 64% for an average event rate 7%). The results are compared with those of a Cochran-Mantel-Haenszel (CMH) test, which yields lower power than the inverse variance weights for low event rates (under 62% for average event rate 7%) and consistently lower power than the Z-test with the Mantel-Haenszel stratum weights. Moreover, the CMH test breaks down (power reduction by 30%) in presence of loss-to-follow-up with as little as 5% of the patients due to its nature of not being designed for time-to-event data. The performance of the Z-test for Kaplan-Meier differences on the other hand is hardly suffering from the latter (power reduction by 4%). All investigated tests satisfy the set significance level for type-I errors in our simulation study.

Survival and Event History Analysis I

Chairs: Jan Beyersmann and Georg Zimmermann

CASANOVA: Permutation inference in factorial survival designs
Marc Ditzhaus1, Arnold Janssen2, Markus Pauly1
1TU Dortmund, Germany; 2Heinrich-Heine-University Duesseldorf

In this talk, inference procedures for general factorial designs with time-to-event endpoints are presented. Similar to additive Aalen models, null hypotheses are formulated in terms of cumulative hazards. Deviations are measured in terms of quadratic forms in Nelson–Aalen-type integrals. Different to existing approaches, this allows to work without restrictive model assumptions as proportional hazards. In particular, crossing survival or hazard curves can be detected without a significant loss of power. For a distribution-free application of the method, a permutation strategy is suggested. The theoretical findings are complemented by an extensive simulation study and the discussion of a real data example.

Statistical MODEling of Additive Time Effects in Survival Analysis
Annika Hoyer1, Oliver Kuss2
1Department of Statistics, Ludwig-Maximilians-University Munich, Germany; 2Institute for Biometrics and Epidemiology, German Diabetes Center, Leibniz Institute for Diabetes Research at Heinrich-Heine-University Duesseldorf, Germany

In survival analysis, there have been various efforts to model intervention or exposure effects on an additive rather than on a hazard, odds or accelerated life scale. Though it might be intuitively clear that additive effects can be easier understood, there is also evidence from randomized trials that this is indeed the case: treatment benefits are easier understood if communicated as postponement of an adverse event [1]. In clinical practice, physicians and patients tend to interpret an additive effect on the time scale as a gain in life expectancy which is added as additional time to the end of life [2]. However, as the gain in life expectancy is, from a statistical point of view, an integral, this is not a precise interpretation. As an easier interpretable alternative we propose to model the increasing „life span“ [3] and to examine the corresponding densities instead of the survival functions. Focussing on the respective modes, the difference of them describes a change in life span, especially the shifting of the most probable event time. Therefore, it seems reasonable to model differences in life time in terms of mode differences instead of differences in expected times. To this task, we propose mode regression models (which we write “Statistical MODEls” to emphasize that the modes are modelled) based on parametric distributions (Gompertz, Weibull and log-normal). We illustrate our MODEls by an example from a randomized controlled trial on efficacy of a new glucose-lowering drug for the treatment of type 2 diabetes.

[1] Dahl R, Gyrd-Hansen D, Kristiansen IS, et al. Can postponement of an adverse outcome be used to present risk reductions to a lay audience? A population survey. BMC Med Inform Decis Mak 2007; 7:8

[2] Detsky AS, Redelmeier DA. Measuring health outcomes-putting gains into perspective. N Engl J Med 1998; 339:402-404 [3] Naimark D, Naglie G, Detsky AS. The meaning of life expectancy: what is a clinically significant gain? J Gen Intern Med 1994; 9:702-707

Assessment of additional benefit for time-to-event endpoints after significant phase III trials – investigation of ESMO and IQWiG approaches
Christopher Alexander Büsch, Johannes Krisam, Meinhard Kieser
University of Heidelberg, Germany

New cancer treatments are often promoted as major advances after a significant phase III trial. Therefore, a clear and unbiased knowledge about the magnitude of the clinical benefit of newly approved treatments is important to assess the amount of reimbursement from public health insurance of new treatments. To perform these evaluations, two distinct “additional benefit assessment” methods are currently used in Europe.

The European Society for Medical Oncology (ESMO) developed the Magnitude of Clinical Benefit Scale Version 1.1 (ESMO-MCBSv1.1) classifying new treatments into 5 categories using a dual rule considering the relative and absolute benefit assessed by the lower limit of the 95% HR confidence interval or the observed absolute difference in median treatment outcomes, respectively[1,2]. As an alternative, the German IQWiG compares the upper limit of the 95% HR confidence interval to specific relative risk scaled thresholds classifying new treatments into 6 categories[4]. Until now, these methods have only been compared empirically[3].

We evaluate and compare the two methods in a simulation study with focus on time-to-event outcomes. The simulation includes aspects such as different censoring rates and types, incorrect HRs assumed for sample size calculation, informative censoring, and different failure time distributions. Since no “placebo” method reflecting a true (deserved) maximal score is available, different thresholds of the simulated treatment effects were used as alternatives. The methods’ performance is assessed via ROC curves, sensitivity / specificity, and the methods’ percentage of achieved maximal scores. Our results indicate that IQWiGs method is usually more conservative than ESMOs. Moreover, in some scenarios such as quick disease progression or incorrect assumed HR IQWiGs method is too liberal compared to ESMO. Nevertheless, further research is required, e.g. methods’ performance under non-proportional hazards.


[1] N.I. Cherny, U. Dafni et al. (2017): ESMO-Magnitude of Clinical Benefit Scale version 1.1. Annals of Oncology, 28:2340-2366

[2] N.I. Cherny, R. Sullivan et al. (2015): A standardised, generic, validated approach to stratify the magnitude of clinical benefit that can be anticipated from anti-cancer therapies: the European Society for Medical Oncology Magnitude of Clinical Benefit Scale (ESMO-MCBS). Annals of Oncology, 26:1547-1573

[3] U. Dafni, D. Karlis et al. (2017): Detailed statistical assessment of the characteristics of the ESMO Magnitude of Clinical Benefit Scale (ESMO-MCBS) threshold rules. ESMO Open, 2:e000216

[4] G. Skipka, B. Wieseler et al. (2016): Methodological approach to determine minor, considerable, and major treatment effects in the early benefit assessment of new drugs. Biometrical Journal, 58:43-58

Independent Censoring in Event-Driven Trials with Staggered Entry
Jasmin Rühl
Universitätsmedizin Göttingen, Germany

In the pharmaceutical field, randomised clinical trials with time-to-event endpoints are frequently stopped after a pre-specified number of events has been observed. This practice leads to dependent data and non-random censoring, though, which can generally not be solved by conditioning on the underlying baseline information.

If the observation period starts at the same time for all of the subjects, the assumption of independent censoring in the counting process sense is valid (cf. Andersen et al., 1993, p. 139), and the common methods for analysing time-to-event data can be applied. The situation is not as clear in case that staggered study entry is considered, though. We demonstrate that the study design at hand indeed entails general independent censoring in the sense of Andersen et al.

By means of simulations, we further investigate possible consequences of employing techniques such as the non-parametric bootstrap that make the more restrictive assumption of random censoring. The results indicate that the dependence in event-driven data with staggered entry is generally too weak to affect the outcomes; however, in settings where only few occurrences of the regarded event are observed, the implications become clearer.

Personalised Medicine

Chairs: Tim Friede and Cynthia Huber

Learning about personalised effects: transporting anonymized information from individuals to (meta-) analysis and back
Els Goetghebeur
Ghent University, Belgium

Evidence on `personalized‘ or stratified medicine draws information from subject-specific records on prognostic factors, (point) treatment and outcome from relevant population samples. A focus on treatment by covariate interactions makes such studies more data hungry than a typical trial focusing on a population average effect. Nationwide disease registers or individual patient meta-analysis may overcome sample size issues, but encounter new challenges, especially when targeting a risk or survival outcome. Between study heterogeneity comes as a curse and a blessing when aiming to transport treatment effects to new patient populations. It reveals sources of variation with specific roles in the transportation. For survival analysis special attention must be given to calendar time and internal time. A core set of covariates is needed for the stratified analysis, ideally measured with similar precision. A shared minimum follow-up time, and well understood censoring mechanisms are expected. Variation in baseline hazards under standard of care may reflect between study variation in diagnostic criteria, in populations, in unmeasured baseline covariates, in standard of care delivery and its impact.

When core set covariates are lacking for some studies or over certain stretches of time, a range of solutions may be considered. Different assumptions on missing covariates of survival models will affect treatment balancing IPW and direct standardization methods differentially. Alternatively, one may seek to link data from different sources to fill the gaps. It then pays to consider models with estimators that can be calculated from (iteratively) constructed summary statistics involving weighted averages over functions of the missing covariates. By thus avoiding the need for additional individually linked measures, one may open access to a range of existing covariates or biomarkers that can be merged while circumventing (time)consuming lengthy confidentiality agreements.

In light of the above we discuss pros and cons of various methods of standardizing effects to obtain transportable answers that are meaningfully compared between studies. We thus aim to provide relevant evidence on stratified interventions referring to several case studies.

Precision medicine in action – the FIRE3 NGS study
Laura Schlieker1, Nicole Krämer1, Volker Heinemann2,3, Arndt Stahler4, Sebastian Stintzing3,4
1Staburo GmbH, Germany; 2Department of Medicine III, University Hospital, University of Munich, Germany; 3DKTK, German Cancer Consortium, German Cancer Research Centre (DKFZ); 4Medical Department, Division of Hematology, Oncology and Tumor Immunology (CCM), Charité Universitaetsmedizin Berlin

The choice of the right treatment for patients based on their individual genetic profile is of utmost importance in precision medicine. To identify potential signals within the large number of biomarkers it is mandatory to define criteria for signal detection beforehand and apply appropriate statistical models in the setting of high dimensional data.

For the identification of predictive and prognostic genetic variants as well as tumor mutational burden (TMB) in patients with metastatic colorectal cancer, we derived the following pre-defined and hierarchical criteria for signal detection.

a) All biomarkers identified via a multivariate variable selection procedure

b) If a) reveals no signal, all biomarkers with adjusted p-value ≤ 0.157

c) If neither a) nor b) reveals signals, the top 5 biomarkers according to sorted, adjusted p-value

Regularized regression models were used for variable selection, and the stability of the selection process was quantified and visualized. Selected biomarkers were analyzed in terms of their predictive potential on a continuous scale.

With our analyses we confirmed the predictive potential of several already known biomarkers and identified additional promising candidate variants. Furthermore, we identified TMB as a potential prognostic biomarker with a trend towards prolonged survival for patients with high TMB.

Our analyses were supported by power simulations for the variable selection method, assuming different prevalences of biomarkers, numbers of truly predictive biomarkers and effect sizes.

Tree-based Identification of Predictive Factors in Randomized Trials using Weibull Regression
Wiebke Werft1, Julia Krzykalla2, Dominic Edelmann2, Axel Benner2
1Hochschule Mannheim University of Applied Sciences, Germany; 2German Cancer Research Center (DKFZ), Heidelberg, Germany

Keywords: Predictive biomarkers, Effect modification, Random forest, Time-to-event endpoint, Weibull regression

Novel high-throughput technology provides detailed information on the biomedical characteristics of each patient’s disease. These biomarkers may qualify as predictive factors that distinguish patients who benefit from a particular treatment from patients who do not. Hence, large numbers of biomarkers need to be tested in order to gain evidence for tailored treatment decisions („personalized medicine“). Tree-based methods divide patients into subgroups with differential treatment effects in an automated and data-driven way without requiring extensive pre-specification. Most of these methods mainly aim for a precise prediction of the individual treatment effect, thereby ignoring interpretability of the tree/random forest.

We propose a modification of the model-based recursive partitioning (MOB) approach for subgroup analyses (Seibold, Zeileis et al. 2016), the so-called predMOB, that is able to specifically identify predictive factors (Krzykalla, Benner et al. 2020) from a potentially large number of candidate biomarkers. The original predMOB is developed for normally distributed endpoints only. To widen the field of application, particularly for time-to-event endpoints, we enhanced the predMOB to these situations. More specifically, we use Weibull regression for the base model in the nodes of the tree since MOB and predMOB require fully parametrized models. However, the Weibull model includes the shape parameter as a nuisance parameter which has to be fixed to focus on predictive biomarkers only.

The performance of this extension of the predMOB is assessed concerning identification of the predictive factors as well as prediction accuracy of the individual treatment effect and the predictive effects. Using simulation studies, we are able to show that predMOB provides a targeted approach to predictive factors by reducing the erroneous selection of biomarkers that are only prognostic.

Furthermore, we apply our method to a data set of primary biliary cirrhosis (PBC) patients treated with D-penicillamine or placebo in order to compare our results to those obtained by Su et al. 2008. The aim is to identify predictive factors with respect to overall survival. On the whole, similar variables are identified, but the ranking differs.


Krzykalla, J., et al. (2020). „Exploratory identification of predictive biomarkers in randomized trials with normal endpoints.“ Statistics in Medicine 39(7): 923-939.

Seibold, H., et al. (2016). „Model-based recursive partitioning for subgroup analyses.“ The International Journal of Biostatistics 12(1): 45-63.

Su. X., et al. (2008). “Interaction trees with censored survival data.” The International Journal of Biostatistics 4(1).