Location: https://wwu.zoom.us/j/67089064576

Evidence Based Medicine and Meta-Analysis II

Chairs: Gudio Knapp and Gerta Rücker

Investigating treatment-effect modification by a continuous covariate in IPD meta-analysis: an approach using fractional polynomials
Willi Sauerbrei1, Patrick Royston2
1Medical Center – University of Freiburg, Germany; 2MRC Clinical Trials Unit at UCL, London, UK

Context: In clinical trials, there is considerable interest in investigating whether a treatment effect is similar in all patients, or that some prognostic variable indicates a differential response to treatment. To examine this, a continuous predictor is usually categorised into groups according to one or more cutpoints. Several weaknesses of categorisation are well known.

Objectives: To avoid the disadvantages of cutpoints and to retain full information, it is preferable to keep continuous variables continuous in the analysis. The aim is to derive a statistical procedure to handle this situation when individual patient data (IPD) are available from several studies.

Methods: For continuous variables, the multivariable fractional polynomial interaction (MFPI) method provides a treatment effect function (TEF), that is, a measure of the treatment effect on the continuous scale of the covariate (Royston and Sauerbrei, Stat Med 2004, 2509‐25). MFPI is applicable to most of the popular regression models, including Cox and logistic regression. A meta‐analysis approach for averaging functions across several studies has been proposed (Sauerbrei and Royston, Stat Med 2011, 3341‐60). A first example combining these two techniques (called metaTEFs) was published (Kasenda et al, BMJ Open 2016; 6:e011148). Another approach called meta-stepp was proposed (Wang et al, Stat Med 2016, 3704- 16). Using the data from Wang (8 RCTs in patients with breast cancer) we will illustrate various issues of our metaTEFs approach.

Results and Conclusions: We used metaTEFs to investigate a potential treatment effect modifier in a meta‐analysis of IPD from eight RCTs. In contrast to cutpoint‐based analyses, the approach avoids several critical issues and gives more detailed insight into how the treatment effect is related to a continuous biomarker. MetaTEFs retains the full information when performing IPD meta‐analyses of continuous effect modifiers in randomised trials. Early experience suggests it is a promising approach.

Standardisierte Mittelwertdifferenzen aus Mixed Model Repeated Measures – Analysen
Lars Beckmann, Ulrich Grouven, Guido Skipka
Institut für Qualität und Wirtschaftlichkeit im Gesundheitswesen (IQWiG), Deutschland

In klinischen Studien werden für Patientinnen und Patienten häufig Daten zur gesundheitsbezogenen Lebensqualität und zur Symptomatik zu aufeinanderfolgenden Zeitpunkten erhoben. Für die Auswertung dieser longitudinalen Daten werden in der Literatur lineare gemischte Modelle für Messwiederholungen (Mixed Models Repeated Measures – Modelle [MMRM]) vorgeschlagen. Diese Endpunkte werden in der Regel mit Skalen mit nicht natürlichen Einheiten gemessen.

Es liegt nahe, für die Bestimmung einer klinischen Relevanz oder für die Durchführung von Metaanalysen auf standardisierte Mittelwertdifferenzen (SMD), wie beispielsweise Cohens d oder Hedges’ g, zurückzugreifen. Allerdings ist unklar, wie die für die SMD benötigte gepoolte Standardabweichung aus MMRM – Analysen berechnet werden kann. Anhand einer Simulationsstudie wurden verschiedene Verfahren zur Schätzung einer SMD untersucht. Die Verfahren lassen sich unterteilen in Ansätze, die auf die im MMRM geschätzten Standardfehler der Mittelwertdifferenz (MD) zurückgreifen, und in Ansätze, die die individuellen Patientendaten (IPD) benutzen.

Simuliert wurden Daten einer randomisierten kontrollierten Studie. Die longitudinalen Daten wurden mittels eines autoregressiven Modells 1. Ordnung (AR) für die Abhängigkeiten zwischen den Erhebungszeitpunkten simuliert. Parameter für die Simulationen waren die SMD, die Varianz für die Änderung zum Ausgangswert, die Korrelation für das AR sowie die Stichprobengrößen in den Therapiearmen. Der betrachtete Endpunkt ist die Differenz zwischen den Therapiearmen hinsichtlich der mittleren Änderung zum Ausgangswert über den gesamten Studienverlauf. Die verschiedenen Verfahren wurden bezüglich Überdeckungswahrscheinlichkeit, Verzerrung, Mean Squared Error (MSE), Power und Fehler 1. Art sowie Konkordanz von MD und SMD bez. der statistischen Signifikanz und der Überdeckung des wahren Effektes verglichen.

Die Verfahren, bei denen die gepoolte Standardabweichung aus Standardfehlern des MMRM berechnet wird, zeigen Verzerrungen, die zu einer deutlichen Überschätzung des wahren Effektes führen. Verfahren, die die gepoolte Standardabweichung aus den beobachteten Veränderungen zum Studienanfang schätzen, zeigen eine deutlich geringere Verzerrung und einen geringeren MSE. Allerdings ist die Power, im Vergleich zur MD, kleiner.

Die Schätzung einer SMD mittels der Standardfehler aus dem MMRM ist nicht angemessen. Dies ist insbesondere bei der Bewertung von großen SMDs zu berücksichtigen. Zu einer angemessenen Schätzung einer SMD sind Verfahren notwendig, aus denen die gepoolte Standardabweichung der Änderung zum Ausgangswert mit IPD geschätzt werden kann.

Robust Covariance Estimation in Multivariate Meta-Regression
Thilo Welz
TU Dortmund University, Germany

Univariate Meta-Regression (MR) is an important technique for medical and psychological research and has been deeply researched. Its multivariate counterpart, however, remains less explored. Multivariate MR holds the potential to incorporate the dependency structure of multiple effect measures as opposed to performing multiple univariate analyses. We explore the possibilities for robust estimation of the covariance of the coefficients in our multivariate MR model. More specifically, we extend heteroscedasticity consistent (also called sandwich or HC-type) estimators from the univariate to the multivariate context. These, along with the Knapp-Hartung adjustment, proved useful in previous work (see Viechtbauer (2015) for an analysis of Knapp-Hartung and Welz & Pauly (2020) for HC-estimators in univariate MR). In our simulations we focus on the bivariate case, which is important for incorporating secondary outcomes as in Copas et al. (2018), but higher dimensions are also possible. The validity of the considered robust estimators is evaluated based on the type-I-error and power of statistical tests based on these estimators. We compare our robust estimation approach with a classical (non-robust) procedure. Finally, we highlight some of the numerical and statistical issues we encountered and provide pointers for others wishing to employ these methods in their analyses.

A Bayesian approach to combine rater assessments
Lorenz Uhlmann1,2, Christine Fink3, Christian Stock2, Marc Vandemeulebroecke1, Meinhard Kieser2
1Novartis Pharma AG, Basel, Switzerland; 2Institute of Medical Biometry and Informatics, University of Heidelberg, Heidelberg, Germany; 3Department of Dermatology, University Medical Center, Ruprecht-Karls University, Heidelberg, Germany

Background: Ideally, endpoints in clinical studies are objectively measurable and easy to assess. However, sometimes this is infeasible and alternative approaches based on (more subjective) rater assessments need to be considered. A Bayesian approach to combine such rater assessments and to estimate relative treatment effects is proposed. Methods: We focus on a setting where each subject is observed under the condition of every group and where one or multiple raters assign scores that constitute the endpoints. We further assume that the raters compare the arms in a pairwise way by simply scoring them on an individual subject-level. This setting has principle similarities to network meta-analysis where groups (or treatment arms) are ranked in a probabilistic fashion. Many ideas from this field, such as heterogeneity (within raters) or inconsistency (between raters), can be directly applied. We build on Bayesian methodology used in this field and derive models for normally distributed and ordered categorical scores which take into account an arbitrary number of raters and groups. Results: A general framework is created which is at the same time easy to implement and allows for a straightforward interpretation of the results. The method is illustrated with a real clinical study example on a computer-aided hair detection and removal algorithm in dermatoscopy. Raters assessed the image quality of pictures generated by the algorithm compared to pictures of unshaved and shaved nevis. Conclusion: A Bayesian approach to combine rater assessments based on an ordinal or continuous scoring system to compare groups in a pairwise fashion is proposed and illustrated using a real data example. The model allows to assess all pairwise comparisons among multiple groups. Since the approach is based on the well-established network meta-analysis methodology, many characteristics can be inferred from that methodology.

Evidence Based Medicine and Meta-Analysis I

Chairs: Ralf Bender and Guido Schwarzer

Network meta-analysis for components of complex interventions
Nicky Welton
University of Bristol, UK

Meta-analysis is used to combine results from studies identified in a systematic review comparing specific interventions for a given patient population. However, the validity of the pooled estimate from a meta-analysis relies on the study results being similar enough to pool (homogeneity). Heterogeneity in study results can arise for various reasons, including differences in intervention definitions between studies.  Network-meta-analysis (NMA) is an extension of meta-analysis that can combine results from studies to estimate relative effects between multiple (2 or more) interventions, where each study compares some (2 or more) of the interventions of interest. NMA can reduce heterogeneity by treating each intervention definition as a distinct intervention. However, if there are many distinct interventions then evidence networks may be sparse or disconnected so that relative effect estimates are imprecise or not possible to estimate at all. Interventions can sometimes be considered to be made up of component parts, such as some complex interventions or combination therapies.

Component network meta-analysis has been proposed for the synthesis of complex interventions that can be considered a sum of component parts. Component NMA is a form of network meta-regression that estimates the effect of the presence of particular components of an intervention. We discuss methods for categorisation of intervention components, before going on to introduce statistical models for the analysis of the relative efficacy of specific components or combinations of components. The methods respect the randomisation in the included trials and allow the analyst to explore whether the component effects are additive, or if there are interactions between them. The full interaction model corresponds to a standard NMA model.

We illustrate the methods with a range of examples including CBT for depression, electronic interventions for smoking cessation, school-based interventions for anxiety and depression, and psychological interventions for patients with coronary heart disease. We discuss the benefits of component NMA for increasing precision and connecting networks of evidence, the data requirements to fit the models, and make recommendations for the design and reporting of future randomised controlled trials of complex interventions that comprise component parts.

Model selection for component network meta-analysis in disconnected networks: a case study
Maria Petropoulou, Guido Schwarzer, Gerta Rücker
Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg, Germany

Standard network meta-analysis (NMA) synthesizes direct and indirect evidence of randomized controlled trials (RCTs), estimating the effects of several competing interventions. Many healthcare interventions are complex, consisting of multiple, possibly interacting, components. In such cases, more general models, the component network meta-analysis (CNMA) models, allow estimating the effects of components of interventions.

Standard network meta-analysis requires a connected network. However, sometimes a disconnected network (two or more subnetworks) can occur when synthesizing evidence from RCTs. Bridging the gap between subnetworks is a challenging issue. CNMA models allow to “reconnect” a network with multi-component interventions if there are common components in subnetworks. Forward model selection for CNMA models, which has recently been developed, starts with a sparse CNMA model and, by adding interaction terms, ends up with a rich CNMA. By model selection, the best CNMA model is chosen based on a trade-off between goodness of fit (minimizing Cochran’s Q statistic) and connectivity.

Our aim is to check whether CNMA models for disconnected networks can validly re-estimate the results of a standard NMA for a connected network (benchmark). We applied the methods to a case study comparing 27 interventions for any adverse event of postoperative nausea and vomiting. Starting with the connected network, we artificially constructed disconnected networks in a systematic way without dropping interventions, such that the network keeps its size. We ended up with nine disconnected networks differing in network geometry, the number of included studies, and pairwise comparisons. The forward strategy for selecting appropriate CNMA models was implemented and the best CNMA model was identified for each disconnected network.

We compared the results of the best CNMA model for each disconnected network to the corresponding results for the connected network with respect to bias and standard error. We found that the results of the best CNMA models from each disconnected network are comparable with the benchmark. Based on our findings, we conclude that CNMA models, which are entirely based on RCT evidence, are a promising tool to deal with disconnected networks if some treatments have common components in different subnetworks. Additional analyses are planned to be conducted to simulated data under several scenarios for the generalization of results.

Uncertainty in treatment hierarchy in network meta-analysis: making ranking relevant
Theodoros Papakonstantinou1,2, Georgia Salanti1, Dimitris Mavridis3,4, Gerta Rücker2, Guido Schwarzer2, Adriani Nikolakopoulou1,2
1Institute of Social and Preventive Medicine, University of Bern, Switzerland; 2Institute of Medical Biometry and Statistics, University of Freiburg, Germany; 3Department of Primary Education, University of Ioannina, Ioannina, Greece; 4Faculty of Medicine, Paris Descartes University, Paris, France

Network meta-analysis estimates all relative effects between competing treatments and can produce a treatment hierarchy from the least to the most desirable option. While about half of the published network meta-analyses report a ranking metric for the primary outcome, methodologists debate several issues underpinning the derivation of a treatment hierarchy. Criticisms include that ranking metrics are not accompanied by a measure of uncertainty or do not answer a clinically relevant question.

We will present a series of research questions related to network meta-analysis. For each of them, we will derive hierarchies that satisfy the set of constraints that constitute the research question and define the uncertainty of these hierarchies. We have developed an R package to calculate the treatment hierarchies.

Assuming a network of T treatments, we start by deriving the most probable hierarchies along with their probabilities. We derive the probabilities of each possible treatment hierarchy (T! permutations in total) by sampling from a multivariate normal distribution with relative treatment effects as means and corresponding variance-covariance matrix. Having the frequencies for each treatment hierarchy to arise, we define complex clinical questions: probability that (1) a specific hierarchy occurs, (2) a given order is retained in the network (e.g. A is better than B and B is better than C), (3) a specific triplet of quadruple of interventions is the most efficacious, (4) a treatment is in at a specific hierarchy position and (5) a treatment is in a specific or higher position in the hierarchy. These criteria can also be combined so that any number of them simultaneously holds, either of them holds or exactly one of them holds. For each defined question, we derive the hierarchies that satisfy the set criteria along with their probability. The sum of probabilities of all hierarchies that fulfill the criterion gives the probability of the criterion to hold. We extend the procedure to compare relative treatment effects against a clinically important value instead of the null effect.

We exemplify the method and its implementation using a network of four treatments for chronic obstructive pulmonary disease where the outcome of interest is mortality and is measured using odds ratio. The most probable hierarchy has a probability of 28%.

The developed methods extend the decision-making arsenal of evidence-based health care with tools that support clinicians, policy makers and patients to make better decisions about the best treatments for a given condition.

Non Clinical Statistics III

Chairs: Michael Brendel and Timur Tug

Discussion of Design and Analysis of Animal Experiments
Edgar Brunner
Universitätsmedizin Göttingen, Germany

In this talk, some aspects of particular topics in designing and analysis of animal experiments are discussed. They are important in applications for funding. Below are the keypoints.

• Replication of the Exepriment

o Different laboratories

o Separate experiments vs. stratification and pooling

o Impact of the mother in case of experiments involving young animals

• Randomization and Blinding

o Onsite-randomization vs. central randomization

• Sample Size Planning

o Quite rarely used

o Discussion and definition of a ‘relevant effect’

o Effects based observed in a preliminary study – often based on a few observations

o Relation to effects in human trials may be a problem

o Type-I Error adjusting for multiple / co-primary endpoints

o Power adjusting for multiple / co-primary endpoints

o Switching from non-normal data to rank procedures

• Analysis

o Data base cleaning

o Principle ‘analyze as randomized’

o Pre-testing assumptions on the same data set is not recommended


Festing, M.F. (2007). The design of animal experiments. Chpt.3 in: Handbook on Care and Management of Laboratory and Pet Animals. Ed. Y. B. Rajeshwari. ISBN: 8189422987.

Exner, C., Bode, H.-J.,Blumer, K., Giese, C. (2007). Animal Experiments in Research. Deutsche Forschungsgemeinschaft. ISBN 978-3-932306-87-7

Statistical evaluation of the flow cytometric micronucleus in vitro test – same but different
Lea AI Vaas1, Robert Smith2, Jeffrey Bemis3, Javed Ahmad2, Steven Bryce3, Christine Marchand4, Roland Froetschl5, Azeddine Elhajouji6, Ulrike Hemmann7, Damian McHugh8, Julia Kenny9, Natalia Sumption9, Andreas Zeller4, Andreas Sutter10, Daniel Roberts11
1Research & Pre-Clinical Statistics Group, Bayer AG, Berlin, Germany; 2Covance Laboratories Ltd., Harrogate, North Yorkshire, UK; 3Litron Laboratories, Rochester, NY, USA; 4Pharmaceutical Sciences, pRED Innovation Center Basel, F. Hoffmann-La Roche Ltd, Basel, Switzerland; 5Federal Institute for Drugs and Medical Devices (BfArM), Bonn, Germany; 6Preclinical Safety (PCS), Novartis Institutes for BioMedical Research (NIBR), Basel, Switzerland; 7Sanofi-Aventis Deutschland GmbH, Frankfurt, Germany; 8Philip Morris Products S.A., Neuchatel, Switzerland; 9Genetic Toxicology and Photosafety, GlaxoSmithKline, Ware, Hertfordshire, UK; 10Bayer AG, Pharmaceuticals, Investigational Toxicology, Berlin, Germany; 11Genetic and In Vitro Toxicology, Charles River, Skokie, IL, USA

In vitro genotoxicity testing is part of the safety evaluation required for product registration and the initiation of clinical trials. The OECD Test Guideline 487 gives recommendations for the conduct, analysis and interpretation of the in vitro Mammalian Cell Micronucleus (MN) Test. Historically, in vitro MN data have been generated via microscopic examination of cells after exposure to a chemical following scientifically valid, internationally accepted, study designs, that is labour intensive and time consuming. Flow cytometry is an automated technology capable of scoring greater numbers of cells in relatively short time span and analysing genotoxic effects of clastogenic and/or aneugenic origin. However, when acquiring data using flow cytometry, neither the number of cells being evaluated nor the built-in relative survival metrics (cytotoxicity) have undergone critical evaluation for standardization. Herein, we addressed these topics, focusing on the application of the in vitro MN assay scored by flow cytometry (e.g. MicroFlow®) for regulatory purposes. To do so, an international working group comprising genetic toxicologists and statisticians from diverse industry branches, contract research organizations, academia, and regulatory agencies serves as a forum to address the regulatory and technical aspects of submitting GLP-compliant in vitro MN flow cytometry data to support product development and registration.

We will briefly present our motivation and the envisaged initial goals with a focus on the suitability of built-in cytotoxicity metrics for regulatory submissions. Based on a data set collected from multiple cross-industry laboratories the working group additionally evaluates historical control data, recommendations on appropriate study designs, and reviews statistical methods for determining positive micronucleus test results.

Mouse clinical trials of N=1: Do we reduce too much?
Hannes-Friedrich Ulbrich
Bayer AG, Deutschland

In 2015 the IMI2 7th Call for Proposals requested for ‚A comprehensive ‘paediatric preclinical POC platform’‘ for the development of treatments against cancer in children; ‚mouse N=1 trials‘ had to be part of it. The project (ITCC-P4) was launched in 2017.

Four years later the terminology has evolved to ‚mouse clinical trials‘ (MCT). They are experiments where one PDX model (a derivative of a particular patient’s tumor) gets implanted into a number of mice to grow and to be treated by different substances: one mouse per substance [and occasionally more for the vehicle — ITCC-P4 plans with three]. The number of PDX of the same human tumor type is supposed to be „large“; the series of randomized per-patient-tumor experiments are forming a trial. As compared to more ‚classical‘ PDX trials where replicates of mice (usually 6) per substance were used to explore substance differences for one PDX only, mouse clinical trials focus on population response for the considered tumor type. This design is still quite new, „becoming wildly used in pre-clinical oncology drug development, but a statistical framework is yet to be developed“ (Guo et al, 2019). Not too much is published yet on whether the reduction to N=1 is reasonable as compared to an imaginable series of ‚classical‘ PDX trials.

Based on data of the already finished OncoTrack IMI project (on colon cancer) we explore the magnitude of differences between the two approaches using resampling techniques.

In this talk we will report the results of this comparison. Statistical models will be described; criteria for comparing these approaches will be discussed.


• IMI2 ITCC-P4 Project Description

• Guo S et al (2019): Mouse clinical trials in oncology drug development. BMC Cancer 19:718, DOI 10.1186/s12885-019-5907-7

• Williams JA (2017) Patient-Derived Xenografts as Cancer Models for Preclinical Drug Screening, DOI 10.1007/978-3-319-55825-7_10

Statistical Review of Animal trials in Ethics Committees – A Guideline
Sophie K. Piper1,2, Dario Zocholl1,2, Robert Röhle1,2, Andrea Stroux1,2, Ulf Tölch2, Frank Konietschke1,2
1Institute of Biometry and Clinical Epidemiology, Charité – Universitätsmedizin Berlin, Charitéplatz 1, D-10117 Berlin, Germany; 2Berlin Institute of Health (BIH), Anna-Louisa-Karsch Str. 2, 10178 Berlin, Germany

Any experiment or trial involving living organism requires ethical review and agreements. Beyond reviewing medical need and goals of the trial, statistical planning of the design and sample size computations are key review criteria. Errors made in the statistical planning phase can have severe consequences on both the results and conclusions drawn from a trial. Moreover, wrong conclusions therof might proliferate and impact future trials—a rather unethical outcome of any research. Therefore, any trial must be efficient in both a medical and statistical way in answering the questions of interests to be considered as “ethically approvable”.

For clinical trials, ethical review boards are well established. This is, however, not the case for pre-clinical and especially animal trials. While ethical review boards are established within each local authority of animal welfare, most of them do not have an appointed statistician. Moreover, unified standards or guidelines on statistical planning and reporting thereof are currently missing for pre-clinical trials.

It is the aim of our presentation to introduce and discuss

i) the need for proper statistical reviews of animal trials,

ii) a guideline of mandatory ethical review criteria, involving blinding and randomization, and

iii) the need to distinguish the planning of exploratory studies from confirmatory studies in pre-clinical research.

Our statistical criteria for ethical reviews of animal trials have been implemented in a form sheet that has been used from the Landesamt für Gesundheit und Soziales (local authority of animal welfare) in Berlin since 2019. It is online available at https://www.berlin.de/lageso/gesundheit/veterinaerwesen/tierversuche/.

Non Clinical Statistics II

Chairs: Katja Ickstadt and Bernd-Wolfgang Igl

On the Role of Historical Control Data in Preclinical Development
Helena Geys
Johnson & Johnson, Belgium

Historical control databases are established by many companies in order to be able to contextualize results from single studies against previous studies performed under similar conditions, to properly design studies and/or to come up with quality control instruments.

Typical preclinical experiments involve a study of a control group of untreated animals and groups of animals exposed to increasing doses. The ultimate aim is to test for a dose related trend in the response of interest. Usually one would focus on one particular experiment. However, since such experiments are conducted in genetically homogeneous animal strains, historical control data from previous similar experiments are sometimes used in interpreting results of a current study.

The use of historical control data in supporting inferences varies across different assays. For example, in genetic toxicology and safety pharmacology, a response may be considered positive in a specific experiment if the result is outside the distribution of the historical negative control data (95% control limits). Whereas, in carcinogenicity studies, historical control data are particularly useful in classifying tumors as rare or common and for evaluation of disparate findings in dual concurrent controls.

Historical control data are often used to carry out an informal equivalence test, whereby a New Molecular Entity (NME) is considered to be “safe” when the results from the treatment groups fall entirely within the negative control distribution.

In addition, formal statistical procedures have been proposed that allow to incorporate historical control data and to combine them with the current control group in tests trend identification.

Clearly historical control data are playing an important role in preclinical development as quality control and interpretation instrument. Yet, the issue of when and how to use historical control data is still not clear and subject to ongoing debate. In this presentation we will highlight pros and cons and the important role a preclinical statistician can play in this.

A comparison of different statistical strategies for the analysis of data in reproductive toxicology involving historical negative controls
Bernd-Wolfgang Igl, Monika Brüning, Bernd Baier
Boehringer Ingelheim Pharma GmbH & Co. KG, Germany

A fundamental requirement of regulatory bodies for the development of new pharmaceuticals is to perform nonclinical developmental and reproductive toxicology (DART) studies to reveal any possible effect of the test item on mammalian reproduction. Usually DART studies are performed in rats and a further (non-rodent) species and aim to support human clinical trials and market access. General recommendations are given in ICH Guideline S5 allowing various phase-dependent designs for a huge number of parameters. The statistical evaluation of DART data is quite multifaceted due to more or less complex correlation structures between mother and offspring, e.g. maternal weight development, fetus weight, ossification status and number of littermates all in dependence of different test item doses.

Initially, we will sketch a Scrum inspired project that was set-up as a cooperation between Boehringer Ingelheim’s Reproductive Toxicology and Non-Clinical Statistics groups. Then, we will describe the particular role and relevance of historical control data in reproductive toxicology. This will be followed by a presentation of common statistical models and some related open problems. Finally, we will give some simulation-based results on statistical power and sample size for the detection of certain events in DART studies.

A Nonparametric Bayesian Model for Historical Control Data in Reproductive Toxicology
Ludger Sandig1, Bernd Baier2, Bernd-Wolfgang Igl3, Katja Ickstadt4
1Fakultät Statistik, Technische Universität Dortmund; 2Reproductive Toxicology, Nonclinical Drug Safety, Boehringer Ingelheim Pharma GmbH & Co. KG; 3Non-Clinical Statistics, Biostatistics and Data Sciences Europe, Boehringer Ingelheim Pharma GmbH & Co. KG; 4Lehrstuhl für mathematische Statistik und biometrische Anwendungen, Fakultät Statistik, Technische Universität Dortmund

Historical control data are of fundamental importance for the interpretation of developmental and reproductive toxicology studies. Modeling such data presents two challenges: Outcomes are observed on different measurement scales (continuous, counts, categorical) and on multiple nested levels (fetuses within a litter, litters within a group, groups within a set of experiments). We propose a nonparametric Bayesian approach to tackle both of them. By using a hierarchical Dirichlet process mixture model we can capture the dependence structure of observables both within and between litters. Additionally we can accommodate an arbitrary number of variables on arbitrary measurement scales at the fetus level, e.g. fetus weight (continuous) and malformation status (categorical). In a second step we extend the model to incorporate observables at higher levels in the hierarchy, e.g. litter size or maternal weight. Inference in these models is possible using Markov Chain Monte Carlo (MCMC) techniques which we implemented in R. We illustrate our approach on several real-world datasets.

Weightloss as Safety Indicator in Rodents
Tina Lang, Issam Ben Khedhiri
Bayer AG, Germany

In preclinical research, the assessment of animal well-being is crucial to ensure ethical standards and compliance with guidelines. It is a tough task to define rules within which the well-being is deemed ok, and when to claim that the suffering of the animal exceeds a tolerable burden and thus, the animal needs to be sacrificed. Indicators are, e.g., food refusal, listlessness and, most outstanding, body weight.

For rodents, a popular rule states that animals that experiences > 20% body weight loss exceeds limits of tolerable suffering and has to be taken out of the experiment. However, research experiments are of highly various nature (Talbot et al., 2020). An absolute rule for all of them can lead to unnecessary deaths of lab animals that are still within reasonable limits of well-being, but for various reasons fall below the body weight limit.

An additional challenge are studies on juvenile rodents which are still within their growth phase. Here, a weight loss might not be observable, but a reduced weight gain could indicate complications. As a solution, their weight gain is routinely compared to the mean weight gain of a control group of animals. If the weight gain differs by a certain percentage, the animals are excluded from the experiment. In case of frequent weighing and small weight gains in the control group, this leads to mathematically driven exclusion of animals which are fit and healthy.

We propose a different approach of safety monitoring which firstly unify assessment for juvenile and adult animals and secondly compensate for different conditions within different experiments.

If a reasonable control group can be kept within the study design, the body weight within the control group is assumed to be lognormally distributed. Within the interval of mean log body weight plus/minus three standard deviations, about 99.73% of all control animals are expected to be found. We conclude that this interval contains acceptable body weights. As the theoretical means and standard deviations of log body weight are unknown, we checked how their empirical equivalent counterparts perform.

We investigated if the rule leaves all healthy animals in the study and only excludes suffering animals. Our data shows that it outperforms the traditional rules by far. Many animals that would have been excluded by the traditional rules can now stay in the study. Thus, the new rule supports animal welfare, and also increases the power of the experiment.

Non Clinical Statistics I

Chairs: Hannes-Friedrich Ulbrich and Frank Konietschke

Can statistics save preclinical research?
Ulrich Dirnagl
Charité / Berlin Institute of Health, Deutschland


Meggie Danziger1,2, Ulrich Dirnagl1,2, Ulf Toelch2
1Charité – Universitätsmedizin Berlin, Germany; 2BIH QUEST Center for Transforming Biomedical Research

Low statistical power in preclinical experiments has been repeatedly pointed out as a roadblock to successful replication and translation. To increase reproducibility of preclinical experiments under ethical and budget constraints, it is necessary to devise strategies that improve the efficiency of confirmatory studies.

To this end, we simulate two preclinical research trajectories from the exploratory stage to the results of a within-lab replication study based on empirical pre-study odds. In a first step, a decision is made based on exploratory data whether to continue to a replication. One trajectory (T1) employs the conventional significance threshold for this decision. The second trajectory (T2) uses a more lenient threshold based on an a priori determined smallest effect size of interest (SESOI). The sample size of a potential replication study is calculated via a standard power analysis using the initial exploratory effect size (T1) or using a SESOI (T2). The two trajectories are compared regarding the number of experiments proceeding to replication, number of animals tested, and positive predictive value (PPV).

Our simulations show that under the conventional significance threshold, only 32 percent of the initial exploratory experiments progress to the replication stage. Using the decision criterion based on a SESOI, 65 percent of initial studies proceed to replication. T1 results in the lowest number of animals needed for replication (n = 7 per group) but yields a PPV that is below pre-study odds. T2 increases PPV above pre-study odds while keeping sample size at a reasonably low number (n = 23 per group).

Our results reveal that current practice, represented by T1, impedes efforts to replicate preclinical experiments. Optimizing decision criteria and experimental design by employing easily applicable variations as shown in T2 keeps tested animal numbers low while generating more robust preclinical evidence that may ultimately benefit translation.

Information sharing across genes for improved parameter estimation in concentration-response curves
Franziska Kappenberg, Jörg Rahnenführer
TU Dortmund University, Germany

Technologies for measuring high-dimensional gene expression values for tens of thousands of genes simultaneously are well established. In toxicology, for estimating concentration-response curves, such data can be used to understand the biological processes initiated at different concentrations. Increasing the number of concentrations or the number of replicates per concentration can improve the accuracy of the fit, but causes critical additional costs. A statistical approach to obtain higher-quality fits is to exploit similarities between high-dimensional concentration-gene expression data. This idea can also be called information sharing across genes. Parameters of the concentration-response curves can be linked, according to a priori assumptions or estimates of the distributions of the parameters, in a Bayesian framework.

Here, we consider the special case of the sigmoidal 4pLL model for estimating the curves associated with single genes, and we are interested in the EC50 value of the curve, i.e. the concentration at which the half-maximal effect is reached. This value is a parameter of the 4pLL model and can be considered a reasonable indicator for a relevant expression effect of the corresponding gene. We introduce an empirical Bayes method for information sharing across genes in this situation, by modelling the distribution of the EC50 values across all genes. Based on this distribution, for each gene a weighted mean of the individually estimated parameter and the overall mean of the estimated parameters of all genes is calculated. In other words, parameters are shrunk towards an overall mean. We evaluate our approach using several simulation studies that differ with respect to their degree of assumptions made for the distribution of the EC50 values. Finally, the method is also applied to a real gene expression dataset to demonstrate the influence of the analysis strategy on the results.

An intuitive time-dose-response model for cytotoxicity data with varying exposure times
Julia Christin Duda, Jörg Rahnenführer
TU Dortmund University, Germany

Modeling approaches for dose-response or concentration-response analyses are slowly becoming more popular in toxicological applications. For cytotoxicity assays, not only the concentration but also the exposure or incubation time of the compound administered to cells can be varied and might have influence on the response. A popular concentration-response model is the four-parameter log-logistic (4pLL) or, more specific and tailored to cytotoxicity data, the two-parameter log-logistic (2pLL) model. Both models, however, model the response based on the concentration only.

We propose a two-step procedure and a new time-concentration-response model for cytotoxicity data in which both concentration and exposure time are varied. The parameter of interest for the estimation is the EC50 value, i.e. the concentration at which half of the maximal effect is reached. The procedure consists of a testing step and a modeling step. In the testing step, a nested ANOVA test is performed to decide if the exposure time has an effect on the shape of the concentration-response curve. If no effect is identified then a classical 2pLL model is fitted. Otherwise, a new time-concentration-response model called td2pLL is fitted. In this model, we incorporate exposure time information into the 2pLL model by making the EC50 parameter dependent on the exposure time.

In simulation studies inspired by and based on a real data set, we compare the proposed procedure against various alternatives with respect to the precision of the estimation of the EC50 value. In all simulations, the new procedure provides estimates with higher or comparable precision, which demonstrates its universal applicability in corresponding toxicological experiments. In addition, we show that the use of optimal designs for cytotoxicity experiments further improves the EC50 estimates throughout all considered scenarios while reducing numerical problems. In order to facilitate the application in toxicological practice, the developed methods will be made available to practitioners via the R package td2pLL and a corresponding vignette that demonstrates the application on an example dataset.

Nonparametric Statistics and Multivariate Analysis

Chairs: Frank Konietschke and Markus Pauly

Marginalized Frailty-Based Illness-Death Model: Application to the UK-Biobank Survival Data
Malka Gorfine
Tel Aviv University, Israel

The UK Biobank is a large-scale health resource comprising genetic, environmental, and medical information on approximately 500,000 volunteer participants in the United Kingdom, recruited at ages 40–69 during the years 2006–2010. The project monitors the health and well-being of its participants. This work demonstrates how these data can be used to yield the building blocks for an interpretable risk-prediction model, in a semiparametric fashion, based on known genetic and environmental risk factors of various chronic diseases, such as colorectal cancer. An illness-death model is adopted, which inherently is a semi-competing risks model, since death can censor the disease, but not vice versa. Using a shared-frailty approach to account for the dependence between time to disease diagnosis and time to death, we provide a new illness-death model that assumes Cox models for the marginal hazard functions. The recruitment procedure used in this study introduces delayed entry to the data. An additional challenge arising from the recruitment procedure is that information coming from both prevalent and incident cases must be aggregated. Lastly, we do not observe any deaths prior to the minimal recruitment age, 40. In this work, we provide an estimation procedure for our new illness-death model that overcomes all the above challenges.

Distribution-free estimation of the partial AUC in diagnostic studies
Maximilian Wechsung
Charité – Universitätsmedizin Berlin, Germany

The problem of partial area under the curve (pAUC) estimation arises in diagnostic studies in which not the whole receiver operating characteristic (ROC) curve of a diagnostic test with continuous outcome can be evaluated. Typically, the investigator is bound by economical as well as ethical considerations to analyze only that part of the ROC curve which includes true positive rates and false positive rates above and below certain thresholds, respectively. The pAUC is the area under this partial ROC curve. It can be used to evaluate the performance of a diagnostic test with continuous outcome. In our talk, we consider a distribution-free estimator of the pAUC and establish its asymptotic distribution. The results can be used to construct statistical tests to compare the performance of different diagnostic tests.

Ranking Procedures for the Factorial Repeated Measures Design with Missing Data – Estimation, Testing and Asymptotic Theory
Kerstin Rubarth, Frank Konietschke
Charité Berlin, Germany

A commonly used design in health, medical and biomedical research is the repeated measures design. Often, a parametric model is used for the analysis of such data. However, if sample size is rather small or if data is skewed or is on an ordinal scale, a nonparametric approach would fit the data better than a classic parametric approach, e.g. linear mixed models. Another issue, that naturally arises when dealing with clinical or pre-clinical data, is the occurrence of missing data. Most methods can only use a complete data set, if no imputation technique is applied. The newly developed ranking procedure is a flexible method for general non-normal, ordinal, ordered categorical and even binary data and uses in case of missing data all available information instead of only the information obtained from complete cases. The hypotheses are defined in terms of the nonparametric relative effect and can be tested by using quadratic test procedures as well as the multiple contrast test procedure. Additionally, the framework allows for the incorporation of clustered data within the repeated measurements. An example for clustered data are animal studies, where several animals share the same cage and are therefore clustered within a cage. Simulation studies indicate a good performance in terms of the type-I error rate and the power under different alternatives with a missing rate up to 30%, also under non-normal data. A real data example illustrates the application of the proposed methodology.

A cautionary tale on using imputation methods for inference in a matched pairs design.
Burim Ramosaj, Lubna Amro, Markus Pauly
TU Dortmund University, Germany

Imputation procedures in biomedical fields have turned into statistical practice, since further analyses can be conducted ignoring the former presence of missing values. In particular, non-parametric imputationschemes like the random forest or a combination with the stochastic gradient boosting have shown favorable imputation performance compared to the more traditionally used MICE procedure. However, their effect on valid statistical inference has not been analyzed so far. This gap is closed by investigating their validity for inferring mean differences in incompletely observed pairs while opposing them to a recent approach that only works with the given observations at hand. Our findings indicate that machine learning schemes for (multiply) imputing missing values heavily inflate type-I-error in small to moderate matched pairs, even after modifying the test statistics using Rubin’s multiple imputation rule. In addition to an extensive simulation study, an illustrative data example from a breast cancer gene study has been considered.


Chairs: Michael Altenbuchinger and Klaus Jung

AI Models for Multi-Modal Data Integration
Holger Fröhlich
Fraunhofer Gesellschaft e.V.

Precision medicine aims for the delivery of the right treatment for the right patients. One important goal is therefore to identify molecular sub-types of diseases, which opens the opportunity for a better targeted therapy of disease in the future. In that context high throughput omics data has been extensively used. However, analysis of one type of omics data alone provides only a very limited view on a complex and systemic disease such as cancer. Correspondingly, parallel analysis of multiple omics data types is needed and employed more and more routinely. However, leveraging the full potential of multi-omics data requires statistical data fusion, which comes along with a number of unique challenges, including differences in data types (e.g. numerical vs discrete), scale, data quality and dimension (e.g. hundreds of thousands of SNPs vs few hundred miRNAs).

In the first part of my talk I will focus on Pathway based Multi-modal AutoEncoders (PathME) as one possible approach for multi-omics data integration. PathME relies on a multi-modal sparse denoising autoencoder architecture to embed multiple omics types that can be mapped to the same biological pathway. We show that sparse non-negative matrix factorization applied to such embeddings result into well discriminated disease subtypes in several cancer types, which show clinically distinct features. Moreover, each of these subtypes can be associated to subtype-specific pathways, and for each of these pathways it is possible to disentangle the influence of individual omics features, hence providing a rich interpretation.

Going one step further in the second part of my talk I will focus on Variational Autoencoder Modular Bayesian Networks (VAMBN) as merger of Bayesian Networks and Variational Autoencoders to model multiple data modalities (including clinical assessment scores), also in a longitudinal manner. I will specifically demonstrate the application of VAMBN for modeling entire clinical studies in Parkinson’s Disease (PD). Since VAMBN is generative the model can be used to simulate synthetic patients, also under counterfactual scenarios (e.g. age shift by 20 years, modification of disease severity at baseline), which could facilitate the design of clinical studies, sharing of data under probabilistic privacy guarantees and eventually allowing for finding “patients-like-me” within a broader, virtually merged meta-cohort.

NetCoMi: Network Construction and Comparison for Microbiome Data in R
Stefanie Peschel1, Christian L. Müller2,3,4, Erika von Mutius1,5,6, Anne-Laure Boulesteix7, Martin Depner1
1Institute of Asthma and Allergy Prevention, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany; 2Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany; 3Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany; 4Center for Computational Mathematics, Flatiron Institute, New York, USA; 5Dr von Hauner Children’s Hospital, Ludwig-Maximilians-Universität München, Munich, Germany; 6Comprehensive Pneumology Center Munich (CPC-M), Member of the German Center for Lung Research, Munich, Germany; 7Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig-Maximilians-Universität München, Munich, Germany


Network analysis methods are suitable for investigating the microbial interplay within a habitat. Since microbial associations may change between conditions, e.g. between health and disease state, comparing microbial association networks between groups might be useful. For this purpose, the two networks are constructed separately, and either the resulting associations themselves or the network’s properties are compared between the two groups.

Estimating associations for sequencing data is challenging due to their special characteristics – that is, sparsity with a high number of zeros, high dimensionality, and compositionality. Several association measures taking these features into account have been published during the last decade. Furthermore, several network analysis tools, methods for comparing network properties among two or more groups as well as approaches for constructing differential networks are available in the literature. However, no unifying tool for the whole process of constructing, analyzing and comparing microbial association networks between groups is available so far.


We provide the R package „NetCoMi“ implementing this whole workflow starting with a read count matrix originating from a sequencing process, to network construction, up to a statement whether single associations, local network characteristics, the determined clusters, or even the overall network structure differs between the groups. For each of the aforementioned steps, a selection of existing methods suitable for the application on microbiome data is included. Especially the function for network construction contains many different approaches including methods for treating zeros in the data, normalization, computing microbial associations, and sparsifying the resulting association matrix. NetCoMi can either be used for constructing, analyzing and visualizing a single network, or for comparing two networks in a graphical as well as a quantitative manner, including statistical tests.


We illustrate the application of our package using a real data set from GABRIELA study [1] to compare microbial associations in settled dust from children’s rooms between samples from two study centers. The examples demonstrate how our proposed graphical methods uncover genera with different characteristics (e.g. a different centrality) between the groups, similarities and differences between the clusterings, as well as differences among the associations themselfes. These descriptive findings are confirmed by a quantitative output including a statement whether the results are statistically significant.

[1] Jon Genuneit, Gisela Büchele, Marco Waser, Katalin Kovacs, Anna Debinska, AndrzejBoznanski, Christine Strunz-Lehner, Elisabeth Horak, Paul Cullinan, Dick Heederik, et al.The gabriel advanced surveys: study design, participation and evaluation of bias.Paediatric and Perinatal Epidemiology, 25(5):436–447, 2011.

Evaluation of augmentation techniques for high-dimensional gene expression data for the purpose of fitting artificial neural networks
Magdalena Kircher, Jessica Krepel, Babak Saremi, Klaus Jung
University of Veterinary Medicine Hannover, Foundation, Germany


High-throughput transcriptome expression data from DNA microarrays or RNA-seq are regularly checked for their ability to classify samples. However, with further densification of transcriptomic data and a low number of observations – due to a lack of available biosamples, prohibitive costs and ethical reasons – the ratio between the number of variables and available observations is usually very large. As a consequence, classifier performance estimated from training data often tends to be overrated and little robust. It has been demonstrated in many applications that data augmentation can improve the robustness of artificial neural networks. Data augmentation on high-dimensional gene expression data has, however, been very little studied so far.


We investigate the applicability and capacity of two data augmentation approaches including generative adversarial networks (GAN), which have been widely used for augmenting image datasets. Comparison of augmentation methods is carried out in public example data sets from infection research. Besides neural networks, we evaluate the augmentation techniques on the performance of linear discriminant analysis and support vector machines.

Results and Outlook:

First results of a 10-fold cross validation show increased accuracy, sensitivity, specificity and predictive values when using augmented data sets compared to classifier models based on the original data only. A simple augmentation approach by mixed observations shows a similar performance as the computational more expensive approach with GANs. Further evaluations are currently running to better understand the detailed performance of the augmentation techniques.

Comparison of merging strategies for building machine learning models on multiple independent gene expression data sets
Jessica Krepel, Magdalena Kircher, Moritz Kohls, Klaus Jung
University of Veterinary Medicine Hannover, Foundation, Germany


Microarray experiments and RNA-seq experiments allow the simultaneous collection of gene expression data from several thousand genes which can be used in a wide range of biological questions. Nowadays, there are gene expression data available in public databases for many biological and medical research questions. Oftentimes, several independent studies are performed on the same or similar research question. There are several benefits of combining these studies compared to individual analyses. Several approaches for combining independent data sets of gene expression data have been proposed already in the context of differential gene expression analysis and gene set enrichment analysis. Here, we want to compare different strategies for combining independent data sets for the purpose of classification analysis.


We only considered the two-group design, e.g. with class labels diseased and healthy. At different stages of the analysis, the information of the individual studies can be aggregated. We examined three different merging pipelines with regard to the stage of the analysis at which merging is conducted, namely the direct merging of the data sets (strategy A), the merging of the trained classification models (strategy B), and the merging of the classification results (strategy C). We combined the merging pipelines with different methods for classification, linear discriminant analysis (LDA), support vector machines (SVM), and artificial neural networks (ANN). Within each merging strategy, we performed a differential gene expression analysis for dimension reduction to select a set of genes that we then used as feature subset in the classification. We trained and evaluated the classification model on several data subsets in form of a 10-fold cross validation. We first performed a simulation study with pure artificial data, and secondly a study based on a real world data set from the public data repository ArrayExpress that we artificially split into two studies.


With respect to classification accuracy, we found that the strategy of data merging outperformed the strategy of results merging in most of our simulation scenarios with artificial data. Especially when the number of studies is high and the differentiability between the groups is low, strategy A appears as the best performing one. Strategy A performed particularly better than the other two merging approaches when four independent studies were aggregated compared to scenarios with only two independent studies.

Evaluating the quality of synthetic SNP data from deep generative models under sample size constraints
Jens Nußberger, Frederic Boesel, Stefan Lenz, Harald Binder, Moritz Hess
Universitätsklinikum Freiburg, Germany

Synthetic data such as generated by deep generative models are increasingly considered for exchanging biomedical data, such as single nucleotide polymorphism (SNP) data, under privacy constraints. This requires that the employed model did sufficiently well learn the joint distribution of the data. A major limiting factor here is the number of available empirical observations which can be used for training. Until now, there is little evidence of how well the predominant generative approaches, namely variational autoencoders (VAEs), deep Boltzmann machines (DBMs) and generative adversarial networks (GANs) learn the joint distribution of the objective data under sample size constraints. Using simulated SNP data and data from the 1000 genomes project, we here provide results from an in-depth evaluation of VAEs, DBMs and GANs. Specifically, we investigate, how well pair-wise co-occurrences of variables in the investigated SNP data, quantified as odds ratios (ORs), are recovered in the synthetic data generated by the approaches. For simulated as well as the 1000 genomes SNP data, we observe that DBMs generally can recover structure for up to 300 SNPs. However, we also observe a tendency of over-estimating ORs when the DBMs are not carefully tuned. VAEs generally get the direction and relative strength of pairwise ORs right but generally under-estimate their magnitude. GANs perform well only when larger sample sizes are employed and when there are strong pairwise associations in the data. In conclusion, DBMs are well suited for generating synthetic observations for binary omics data, such as SNP data, under sample size constraints. VAEs perform superior at smaller sample sizes but are limited with respect to learning the absolute magnitude of pairwise associations between variables. GANs require large amounts of training data and likely require a careful selection of hyperparameters.

Survival and Event History Analysis II

Chairs: Annika Hoyer and Oliver Kuß

Assessment of methods to deal with delayed treatment effects in immunooncology trials with time-to-event endpoints
Rouven Behnisch, Johannes Krisam, Meinhard Kieser
Institute of Medical Biometry and Informatics, University of Heidelberg, Germany

In cancer drug research and development, immunotherapy plays an ever more important role. A common feature of immunotherapies is a delayed treatment effect which is quite challenging when dealing with time-to-event endpoints [1]. In case of time-to-event endpoints, regulatory authorities often require a log-rank test, which is the standard statistical method. The log-rank test is known to be most powerful under proportional-hazards alternatives but suffers a substantial loss in power if this assumption is violated. Hence, a rather long follow-up period is required to detect a significant effect in immunooncology trials. For that reason, the question arises whether methods exist that are more susceptible to delayed treatment effects and that can be applied early on to generate evidence anticipating the final decision of the log-rank test to reduce the trial duration without inflation of the type I error. Alternative methods include, for example, weighted log-rank statistics with weights that can either be fixed at the design stage of the trial [2] or chosen based on the observed data [3] or tests based on the restricted mean survival time [4], survival proportions, accelerated failure time (AFT) models or additive hazard models.

We evaluate and compare these different methods systematically with regard to type I error control and power in the presence of delayed treatment effects. Our simulation study includes aspects such as different censoring rates and types, different times of delay, and different failure time distributions. First results show that most methods achieve type I error rate control and that, by construction, the weighted log-rank tests which place more weight on late time points have a greater power to detect differences when the treatment effect is delayed. It is furthermore investigated whether and to what extent these methods can be applied at an early stage of the trial to predict the decision of the log-rank test later on.


[1] T. Chen (2013): Statistical issues and challenges in immuno-oncology. Journal for ImmunoTherapy of Cancer 1:18
[2] T.R. Fleming and D.P. Harrington (1991): Counting Processes and Survival Analysis. New York [u.a.]: Wiley-Interscience Publ.
[3] D. Magirr and C. Burman (2019): Modestly weighted logrank tests. Statistics in Medicine 38(20):3782-3790.
[4] P. Royston and M.K.B. Parmar (2013): Restricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome. BMC Medical Research Methodology 13(1):152.

Sampling designs for rare time-dependent exposures – A comparison of the nested exposure case-control design and exposure density sampling
Jan Feifel1, Maja von Cube2, Martin Wolkewitz2, Jan Beyersmann1, Martin Schumacher2
1Institute of Statistics, Ulm University, Germany; 2Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center University of Freiburg, Germany

Hospital-acquired infections increase both morbidity and mortality of hospitalized patients. Researchers interested in the effect of these time-dependent infections on the length-of-hospital stay, as a measure of disease burden, face large cohorts with possibly rare exposures.

For large cohort studies with rare outcomes nested case-control designs are favorable due to the efficient use of limited resources. Here, nested case-control designs apply but do not lead to reduced sample sizes, because the outcome is not necessarily rare, but the exposure is. Recently, exposure density sampling (EDS)[1] and nested exposure case-control design (NECC) [2] have been proposed to sample for a rare time-dependent exposure in cohorts with a survival endpoint. The two designs differ in the time point of sampling.

Both designs enable efficient hazard ratio estimation by sampling all exposed individuals but only a small fraction of the unexposed ones. Moreover, they account for time-dependent exposure to avoid immortal time bias. We investigate and compare their performance using data of patients hospitalized in the neuro-intensive care unit at the Burdenko Neurosurgery Institute (NSI) in Moscow, Russia. The impact of different types of hospital-acquired infections with different prevalence on length-of-stay is considered. Additionally, inflation factors, a primary performance measure, are discussed. All presented methods will be compared to the gold-standard Cox model on the full cohort. We enhance both designs to allow for a competitive analysis of combined and competing endpoints. Additionally, these designs substantially reduce the amount of necessary information compared to the full cohort approach.

Both EDS and NECC are capable of analyzing time-to-event data by simultaneously accounting for rare time-dependent exposure and result in affordable sample sizes. EDS outperforms the NECC concerning efficiency and accuracy in most considered settings for combined endpoints. For competing risks, however, a tailored NECC shows more appealing results.

[1] K. Ohneberg, J. Beyersmann and M. Schumacher (2019): Exposure density sampling: Dynamic matching with respect to a time-dependent exposure. Statistics in Medicine, 38(22):4390-4403.

[2] J. Feifel, M. Gebauer, M. Schumacher and J. Beyersmann (2020): Nested exposure case-control sampling: a sampling scheme to analyze rare time-dependent exposures. Lifetime Data Analysis, 26:21-44.

Tumour-growth models improve progression-free survival estimation in the presence of high censoring-rates
Gabriele Bleckert, Hannes Buchner
Staburo GmbH, Germany

In oncology, reliable estimates of progression-free survival (PFS) are of highest importance because of high failure rates of phase III trials (around 60%). However, PFS estimations on early readouts with less than 50% of events observed do not use all available information from tumour measurements over time.

We project the PFS-event of each censored patient by using a mixed model [2] describing the tumour burden over time. RECIST-criteria are applied on estimated patient-specific non-linear tumour-trajectories to calculate the projected time-to-progression.
PFS is compared between test and reference by hazard ratios (HR). Several phase III and II simulations with 1000 runs each with 2000 or 80 patients, 6 months accrual and 2 (scenario-1) or 6 months (scenario-2) follow-up were performed. All simulations are based on a published optimal parameterisation [1] of tumour-growth in non-small-cell lung cancer (NSCLC) which implies a time-dependent HR.

The classical PFS estimation resulted in a HR of 0.34 (95%-percentiles: 0.29-0.40) for scenario-1 and 0.52 (0.47-0.58) for scenario-2 compared to a predicted HR of 0.77 for both scenarios (0.69-0.85 and 0.69-0.84), while the overall true HR (over ten years) was 0.78 (0.69-0.85). For 6, 12 and 120 months the time varying HRs were 0.41 (0.36-0.47), 0.60 (0.54-0.66) and 0.77 (0.69-0.85). The classical PFS estimation for phase II showed HRs from 0.52 to 0.61 compared to predicted HRs between 0.71 and 0.77.

Tumour-growth models improve PFS estimations in the presence of high censoring-rates as they consistently provide far better estimates of the overall true HR in phase III and II trials.

[1] M. Reck, A. Mellemgaard, S. Novello, PE. Postmus, B. Gaschler-Markefski, R. Kaiser, H. Buchner: Change in non-small-cell lung cancer tumor size in patients treated with nintedanib plus docetaxel: analyses from the Phase III LUME-Lung 1 study, OncoTargets and Therapy 2018:11 4573–4582

[2] Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982 Dec;38(4):963-74. PMID: 7168798.

Evaluation of event rate differences using stratified Kaplan-Meier difference estimates with Mantel-Haenszel weights
Hannes Buchner1, Stephan Bischofberger1, Rainer-Georg Goeldner2
1Staburo, Germany; 2Boehringer Ingelheim, Germany

The assessment of differences in event rates is a common endeavor in the evaluation of the efficacy of new treatments in clinical trials. We investigate the performance of different hypothesis tests for cumulative hospitalization or death rates of Covid-19 in order to reliably determine the efficacy of a novel treatment. The focus of the evaluation was on the comparison of event rates via Kaplan-Meier estimates for a pre-specified day and the aim to reduce sampling error, hence we examine different stratum weights for a stratified Z-test for Kaplan-Meier differences. The simulated data is calibrated from recent research on neutralizing antibodies treatment for Covid-19 with 2, 4, and 6 strata of different size and prevalence, and we investigate the effects of overall event rates ranging from 2% to 20%. We simulate 1000 patients and compare the results of 1000 simulation runs. Our simulation study shows superior performance of Mantel-Haenszel-type weights [Greenland & Robins (1985), Biometrics 41, 55-68] over inverse variance weights – in particular for unequal stratum sizes and very low event rates in some strata as common in COVID-19 treatment studies. The advantage of this approach is a larger power of the test (e.g. 79% instead of 64% for an average event rate 7%). The results are compared with those of a Cochran-Mantel-Haenszel (CMH) test, which yields lower power than the inverse variance weights for low event rates (under 62% for average event rate 7%) and consistently lower power than the Z-test with the Mantel-Haenszel stratum weights. Moreover, the CMH test breaks down (power reduction by 30%) in presence of loss-to-follow-up with as little as 5% of the patients due to its nature of not being designed for time-to-event data. The performance of the Z-test for Kaplan-Meier differences on the other hand is hardly suffering from the latter (power reduction by 4%). All investigated tests satisfy the set significance level for type-I errors in our simulation study.

Survival and Event History Analysis I

Chairs: Jan Beyersmann and Georg Zimmermann

CASANOVA: Permutation inference in factorial survival designs
Marc Ditzhaus1, Arnold Janssen2, Markus Pauly1
1TU Dortmund, Germany; 2Heinrich-Heine-University Duesseldorf

In this talk, inference procedures for general factorial designs with time-to-event endpoints are presented. Similar to additive Aalen models, null hypotheses are formulated in terms of cumulative hazards. Deviations are measured in terms of quadratic forms in Nelson–Aalen-type integrals. Different to existing approaches, this allows to work without restrictive model assumptions as proportional hazards. In particular, crossing survival or hazard curves can be detected without a significant loss of power. For a distribution-free application of the method, a permutation strategy is suggested. The theoretical findings are complemented by an extensive simulation study and the discussion of a real data example.

Statistical MODEling of Additive Time Effects in Survival Analysis
Annika Hoyer1, Oliver Kuss2
1Department of Statistics, Ludwig-Maximilians-University Munich, Germany; 2Institute for Biometrics and Epidemiology, German Diabetes Center, Leibniz Institute for Diabetes Research at Heinrich-Heine-University Duesseldorf, Germany

In survival analysis, there have been various efforts to model intervention or exposure effects on an additive rather than on a hazard, odds or accelerated life scale. Though it might be intuitively clear that additive effects can be easier understood, there is also evidence from randomized trials that this is indeed the case: treatment benefits are easier understood if communicated as postponement of an adverse event [1]. In clinical practice, physicians and patients tend to interpret an additive effect on the time scale as a gain in life expectancy which is added as additional time to the end of life [2]. However, as the gain in life expectancy is, from a statistical point of view, an integral, this is not a precise interpretation. As an easier interpretable alternative we propose to model the increasing „life span“ [3] and to examine the corresponding densities instead of the survival functions. Focussing on the respective modes, the difference of them describes a change in life span, especially the shifting of the most probable event time. Therefore, it seems reasonable to model differences in life time in terms of mode differences instead of differences in expected times. To this task, we propose mode regression models (which we write “Statistical MODEls” to emphasize that the modes are modelled) based on parametric distributions (Gompertz, Weibull and log-normal). We illustrate our MODEls by an example from a randomized controlled trial on efficacy of a new glucose-lowering drug for the treatment of type 2 diabetes.

[1] Dahl R, Gyrd-Hansen D, Kristiansen IS, et al. Can postponement of an adverse outcome be used to present risk reductions to a lay audience? A population survey. BMC Med Inform Decis Mak 2007; 7:8

[2] Detsky AS, Redelmeier DA. Measuring health outcomes-putting gains into perspective. N Engl J Med 1998; 339:402-404 [3] Naimark D, Naglie G, Detsky AS. The meaning of life expectancy: what is a clinically significant gain? J Gen Intern Med 1994; 9:702-707

Assessment of additional benefit for time-to-event endpoints after significant phase III trials – investigation of ESMO and IQWiG approaches
Christopher Alexander Büsch, Johannes Krisam, Meinhard Kieser
University of Heidelberg, Germany

New cancer treatments are often promoted as major advances after a significant phase III trial. Therefore, a clear and unbiased knowledge about the magnitude of the clinical benefit of newly approved treatments is important to assess the amount of reimbursement from public health insurance of new treatments. To perform these evaluations, two distinct “additional benefit assessment” methods are currently used in Europe.

The European Society for Medical Oncology (ESMO) developed the Magnitude of Clinical Benefit Scale Version 1.1 (ESMO-MCBSv1.1) classifying new treatments into 5 categories using a dual rule considering the relative and absolute benefit assessed by the lower limit of the 95% HR confidence interval or the observed absolute difference in median treatment outcomes, respectively[1,2]. As an alternative, the German IQWiG compares the upper limit of the 95% HR confidence interval to specific relative risk scaled thresholds classifying new treatments into 6 categories[4]. Until now, these methods have only been compared empirically[3].

We evaluate and compare the two methods in a simulation study with focus on time-to-event outcomes. The simulation includes aspects such as different censoring rates and types, incorrect HRs assumed for sample size calculation, informative censoring, and different failure time distributions. Since no “placebo” method reflecting a true (deserved) maximal score is available, different thresholds of the simulated treatment effects were used as alternatives. The methods’ performance is assessed via ROC curves, sensitivity / specificity, and the methods’ percentage of achieved maximal scores. Our results indicate that IQWiGs method is usually more conservative than ESMOs. Moreover, in some scenarios such as quick disease progression or incorrect assumed HR IQWiGs method is too liberal compared to ESMO. Nevertheless, further research is required, e.g. methods’ performance under non-proportional hazards.


[1] N.I. Cherny, U. Dafni et al. (2017): ESMO-Magnitude of Clinical Benefit Scale version 1.1. Annals of Oncology, 28:2340-2366

[2] N.I. Cherny, R. Sullivan et al. (2015): A standardised, generic, validated approach to stratify the magnitude of clinical benefit that can be anticipated from anti-cancer therapies: the European Society for Medical Oncology Magnitude of Clinical Benefit Scale (ESMO-MCBS). Annals of Oncology, 26:1547-1573

[3] U. Dafni, D. Karlis et al. (2017): Detailed statistical assessment of the characteristics of the ESMO Magnitude of Clinical Benefit Scale (ESMO-MCBS) threshold rules. ESMO Open, 2:e000216

[4] G. Skipka, B. Wieseler et al. (2016): Methodological approach to determine minor, considerable, and major treatment effects in the early benefit assessment of new drugs. Biometrical Journal, 58:43-58

Independent Censoring in Event-Driven Trials with Staggered Entry
Jasmin Rühl
Universitätsmedizin Göttingen, Germany

In the pharmaceutical field, randomised clinical trials with time-to-event endpoints are frequently stopped after a pre-specified number of events has been observed. This practice leads to dependent data and non-random censoring, though, which can generally not be solved by conditioning on the underlying baseline information.

If the observation period starts at the same time for all of the subjects, the assumption of independent censoring in the counting process sense is valid (cf. Andersen et al., 1993, p. 139), and the common methods for analysing time-to-event data can be applied. The situation is not as clear in case that staggered study entry is considered, though. We demonstrate that the study design at hand indeed entails general independent censoring in the sense of Andersen et al.

By means of simulations, we further investigate possible consequences of employing techniques such as the non-parametric bootstrap that make the more restrictive assumption of random censoring. The results indicate that the dependence in event-driven data with staggered entry is generally too weak to affect the outcomes; however, in settings where only few occurrences of the regarded event are observed, the implications become clearer.

Personalised Medicine

Chairs: Tim Friede and Cynthia Huber

Learning about personalised effects: transporting anonymized information from individuals to (meta-) analysis and back
Els Goetghebeur
Ghent University, Belgium

Evidence on `personalized‘ or stratified medicine draws information from subject-specific records on prognostic factors, (point) treatment and outcome from relevant population samples. A focus on treatment by covariate interactions makes such studies more data hungry than a typical trial focusing on a population average effect. Nationwide disease registers or individual patient meta-analysis may overcome sample size issues, but encounter new challenges, especially when targeting a risk or survival outcome. Between study heterogeneity comes as a curse and a blessing when aiming to transport treatment effects to new patient populations. It reveals sources of variation with specific roles in the transportation. For survival analysis special attention must be given to calendar time and internal time. A core set of covariates is needed for the stratified analysis, ideally measured with similar precision. A shared minimum follow-up time, and well understood censoring mechanisms are expected. Variation in baseline hazards under standard of care may reflect between study variation in diagnostic criteria, in populations, in unmeasured baseline covariates, in standard of care delivery and its impact.

When core set covariates are lacking for some studies or over certain stretches of time, a range of solutions may be considered. Different assumptions on missing covariates of survival models will affect treatment balancing IPW and direct standardization methods differentially. Alternatively, one may seek to link data from different sources to fill the gaps. It then pays to consider models with estimators that can be calculated from (iteratively) constructed summary statistics involving weighted averages over functions of the missing covariates. By thus avoiding the need for additional individually linked measures, one may open access to a range of existing covariates or biomarkers that can be merged while circumventing (time)consuming lengthy confidentiality agreements.

In light of the above we discuss pros and cons of various methods of standardizing effects to obtain transportable answers that are meaningfully compared between studies. We thus aim to provide relevant evidence on stratified interventions referring to several case studies.

Precision medicine in action – the FIRE3 NGS study
Laura Schlieker1, Nicole Krämer1, Volker Heinemann2,3, Arndt Stahler4, Sebastian Stintzing3,4
1Staburo GmbH, Germany; 2Department of Medicine III, University Hospital, University of Munich, Germany; 3DKTK, German Cancer Consortium, German Cancer Research Centre (DKFZ); 4Medical Department, Division of Hematology, Oncology and Tumor Immunology (CCM), Charité Universitaetsmedizin Berlin

The choice of the right treatment for patients based on their individual genetic profile is of utmost importance in precision medicine. To identify potential signals within the large number of biomarkers it is mandatory to define criteria for signal detection beforehand and apply appropriate statistical models in the setting of high dimensional data.

For the identification of predictive and prognostic genetic variants as well as tumor mutational burden (TMB) in patients with metastatic colorectal cancer, we derived the following pre-defined and hierarchical criteria for signal detection.

a) All biomarkers identified via a multivariate variable selection procedure

b) If a) reveals no signal, all biomarkers with adjusted p-value ≤ 0.157

c) If neither a) nor b) reveals signals, the top 5 biomarkers according to sorted, adjusted p-value

Regularized regression models were used for variable selection, and the stability of the selection process was quantified and visualized. Selected biomarkers were analyzed in terms of their predictive potential on a continuous scale.

With our analyses we confirmed the predictive potential of several already known biomarkers and identified additional promising candidate variants. Furthermore, we identified TMB as a potential prognostic biomarker with a trend towards prolonged survival for patients with high TMB.

Our analyses were supported by power simulations for the variable selection method, assuming different prevalences of biomarkers, numbers of truly predictive biomarkers and effect sizes.

Tree-based Identification of Predictive Factors in Randomized Trials using Weibull Regression
Wiebke Werft1, Julia Krzykalla2, Dominic Edelmann2, Axel Benner2
1Hochschule Mannheim University of Applied Sciences, Germany; 2German Cancer Research Center (DKFZ), Heidelberg, Germany

Keywords: Predictive biomarkers, Effect modification, Random forest, Time-to-event endpoint, Weibull regression

Novel high-throughput technology provides detailed information on the biomedical characteristics of each patient’s disease. These biomarkers may qualify as predictive factors that distinguish patients who benefit from a particular treatment from patients who do not. Hence, large numbers of biomarkers need to be tested in order to gain evidence for tailored treatment decisions („personalized medicine“). Tree-based methods divide patients into subgroups with differential treatment effects in an automated and data-driven way without requiring extensive pre-specification. Most of these methods mainly aim for a precise prediction of the individual treatment effect, thereby ignoring interpretability of the tree/random forest.

We propose a modification of the model-based recursive partitioning (MOB) approach for subgroup analyses (Seibold, Zeileis et al. 2016), the so-called predMOB, that is able to specifically identify predictive factors (Krzykalla, Benner et al. 2020) from a potentially large number of candidate biomarkers. The original predMOB is developed for normally distributed endpoints only. To widen the field of application, particularly for time-to-event endpoints, we enhanced the predMOB to these situations. More specifically, we use Weibull regression for the base model in the nodes of the tree since MOB and predMOB require fully parametrized models. However, the Weibull model includes the shape parameter as a nuisance parameter which has to be fixed to focus on predictive biomarkers only.

The performance of this extension of the predMOB is assessed concerning identification of the predictive factors as well as prediction accuracy of the individual treatment effect and the predictive effects. Using simulation studies, we are able to show that predMOB provides a targeted approach to predictive factors by reducing the erroneous selection of biomarkers that are only prognostic.

Furthermore, we apply our method to a data set of primary biliary cirrhosis (PBC) patients treated with D-penicillamine or placebo in order to compare our results to those obtained by Su et al. 2008. The aim is to identify predictive factors with respect to overall survival. On the whole, similar variables are identified, but the ranking differs.


Krzykalla, J., et al. (2020). „Exploratory identification of predictive biomarkers in randomized trials with normal endpoints.“ Statistics in Medicine 39(7): 923-939.

Seibold, H., et al. (2016). „Model-based recursive partitioning for subgroup analyses.“ The International Journal of Biostatistics 12(1): 45-63.

Su. X., et al. (2008). “Interaction trees with censored survival data.” The International Journal of Biostatistics 4(1).