Track: Track 3

Evidence Based Medicine and Meta-Analysis I

Chairs: Ralf Bender and Guido Schwarzer

Network meta-analysis for components of complex interventions
Nicky Welton
University of Bristol, UK

Meta-analysis is used to combine results from studies identified in a systematic review comparing specific interventions for a given patient population. However, the validity of the pooled estimate from a meta-analysis relies on the study results being similar enough to pool (homogeneity). Heterogeneity in study results can arise for various reasons, including differences in intervention definitions between studies.  Network-meta-analysis (NMA) is an extension of meta-analysis that can combine results from studies to estimate relative effects between multiple (2 or more) interventions, where each study compares some (2 or more) of the interventions of interest. NMA can reduce heterogeneity by treating each intervention definition as a distinct intervention. However, if there are many distinct interventions then evidence networks may be sparse or disconnected so that relative effect estimates are imprecise or not possible to estimate at all. Interventions can sometimes be considered to be made up of component parts, such as some complex interventions or combination therapies.

Component network meta-analysis has been proposed for the synthesis of complex interventions that can be considered a sum of component parts. Component NMA is a form of network meta-regression that estimates the effect of the presence of particular components of an intervention. We discuss methods for categorisation of intervention components, before going on to introduce statistical models for the analysis of the relative efficacy of specific components or combinations of components. The methods respect the randomisation in the included trials and allow the analyst to explore whether the component effects are additive, or if there are interactions between them. The full interaction model corresponds to a standard NMA model.

We illustrate the methods with a range of examples including CBT for depression, electronic interventions for smoking cessation, school-based interventions for anxiety and depression, and psychological interventions for patients with coronary heart disease. We discuss the benefits of component NMA for increasing precision and connecting networks of evidence, the data requirements to fit the models, and make recommendations for the design and reporting of future randomised controlled trials of complex interventions that comprise component parts.

Model selection for component network meta-analysis in disconnected networks: a case study
Maria Petropoulou, Guido Schwarzer, Gerta Rücker
Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg, Germany

Standard network meta-analysis (NMA) synthesizes direct and indirect evidence of randomized controlled trials (RCTs), estimating the effects of several competing interventions. Many healthcare interventions are complex, consisting of multiple, possibly interacting, components. In such cases, more general models, the component network meta-analysis (CNMA) models, allow estimating the effects of components of interventions.

Standard network meta-analysis requires a connected network. However, sometimes a disconnected network (two or more subnetworks) can occur when synthesizing evidence from RCTs. Bridging the gap between subnetworks is a challenging issue. CNMA models allow to “reconnect” a network with multi-component interventions if there are common components in subnetworks. Forward model selection for CNMA models, which has recently been developed, starts with a sparse CNMA model and, by adding interaction terms, ends up with a rich CNMA. By model selection, the best CNMA model is chosen based on a trade-off between goodness of fit (minimizing Cochran’s Q statistic) and connectivity.

Our aim is to check whether CNMA models for disconnected networks can validly re-estimate the results of a standard NMA for a connected network (benchmark). We applied the methods to a case study comparing 27 interventions for any adverse event of postoperative nausea and vomiting. Starting with the connected network, we artificially constructed disconnected networks in a systematic way without dropping interventions, such that the network keeps its size. We ended up with nine disconnected networks differing in network geometry, the number of included studies, and pairwise comparisons. The forward strategy for selecting appropriate CNMA models was implemented and the best CNMA model was identified for each disconnected network.

We compared the results of the best CNMA model for each disconnected network to the corresponding results for the connected network with respect to bias and standard error. We found that the results of the best CNMA models from each disconnected network are comparable with the benchmark. Based on our findings, we conclude that CNMA models, which are entirely based on RCT evidence, are a promising tool to deal with disconnected networks if some treatments have common components in different subnetworks. Additional analyses are planned to be conducted to simulated data under several scenarios for the generalization of results.

Uncertainty in treatment hierarchy in network meta-analysis: making ranking relevant
Theodoros Papakonstantinou1,2, Georgia Salanti1, Dimitris Mavridis3,4, Gerta Rücker2, Guido Schwarzer2, Adriani Nikolakopoulou1,2
1Institute of Social and Preventive Medicine, University of Bern, Switzerland; 2Institute of Medical Biometry and Statistics, University of Freiburg, Germany; 3Department of Primary Education, University of Ioannina, Ioannina, Greece; 4Faculty of Medicine, Paris Descartes University, Paris, France

Network meta-analysis estimates all relative effects between competing treatments and can produce a treatment hierarchy from the least to the most desirable option. While about half of the published network meta-analyses report a ranking metric for the primary outcome, methodologists debate several issues underpinning the derivation of a treatment hierarchy. Criticisms include that ranking metrics are not accompanied by a measure of uncertainty or do not answer a clinically relevant question.

We will present a series of research questions related to network meta-analysis. For each of them, we will derive hierarchies that satisfy the set of constraints that constitute the research question and define the uncertainty of these hierarchies. We have developed an R package to calculate the treatment hierarchies.

Assuming a network of T treatments, we start by deriving the most probable hierarchies along with their probabilities. We derive the probabilities of each possible treatment hierarchy (T! permutations in total) by sampling from a multivariate normal distribution with relative treatment effects as means and corresponding variance-covariance matrix. Having the frequencies for each treatment hierarchy to arise, we define complex clinical questions: probability that (1) a specific hierarchy occurs, (2) a given order is retained in the network (e.g. A is better than B and B is better than C), (3) a specific triplet of quadruple of interventions is the most efficacious, (4) a treatment is in at a specific hierarchy position and (5) a treatment is in a specific or higher position in the hierarchy. These criteria can also be combined so that any number of them simultaneously holds, either of them holds or exactly one of them holds. For each defined question, we derive the hierarchies that satisfy the set criteria along with their probability. The sum of probabilities of all hierarchies that fulfill the criterion gives the probability of the criterion to hold. We extend the procedure to compare relative treatment effects against a clinically important value instead of the null effect.

We exemplify the method and its implementation using a network of four treatments for chronic obstructive pulmonary disease where the outcome of interest is mortality and is measured using odds ratio. The most probable hierarchy has a probability of 28%.

The developed methods extend the decision-making arsenal of evidence-based health care with tools that support clinicians, policy makers and patients to make better decisions about the best treatments for a given condition.

IBS-DR Mitgliederversammlung

Vorsitz: IBS-DR Vorstand

Tagesordnung der Mitgliederversammlung 2021

TOP 1 Verabschiedung der Tagesordnung Brannath
TOP 2 Genehmigung des Protokolls der Mitgliederversammlung vom 09.09.2020 Scharpenberg
TOP 3 Bericht des Präsidenten Brannath
TOP 4 Nachwuchspreise Brannath
TOP 5 Berichte aus den internationalen Gremien Bretz, Ickstadt, Kieser, Kübler, Pigeot, Ziegler
TOP 6 Bericht des Schriftführers Scharpenberg
TOP 7 Bericht aus der Geschäftsstelle Scharpenberg
TOP 8 Bericht des Schatzmeisters Knapp
TOP 9 Bericht der Kassenprüfer Dierig, Tuğ
TOP 10 Beschlüsse über Rückstellungen und Mitgliedsbeiträge 2022 Knapp
TOP 11 Berichte aus den Arbeitsgruppen Asendorf
TOP 12 Sommerschulen, Weiterbildung Brannath
TOP 13 Zukünftige Kolloquien Brannath
TOP 14 Biometrical Journal Bathke, Schmid
TOP 15 Bericht des Wahlleiters über die Beiratswahl Gerß
TOP 16 Verschiedenes Brannath

Non Clinical Statistics III

Chairs: Michael Brendel and Timur Tug

Discussion of Design and Analysis of Animal Experiments
Edgar Brunner
Universitätsmedizin Göttingen, Germany

In this talk, some aspects of particular topics in designing and analysis of animal experiments are discussed. They are important in applications for funding. Below are the keypoints.

• Replication of the Exepriment

o Different laboratories

o Separate experiments vs. stratification and pooling

o Impact of the mother in case of experiments involving young animals

• Randomization and Blinding

o Onsite-randomization vs. central randomization

• Sample Size Planning

o Quite rarely used

o Discussion and definition of a ‘relevant effect’

o Effects based observed in a preliminary study – often based on a few observations

o Relation to effects in human trials may be a problem

o Type-I Error adjusting for multiple / co-primary endpoints

o Power adjusting for multiple / co-primary endpoints

o Switching from non-normal data to rank procedures

• Analysis

o Data base cleaning

o Principle ‘analyze as randomized’

o Pre-testing assumptions on the same data set is not recommended


Festing, M.F. (2007). The design of animal experiments. Chpt.3 in: Handbook on Care and Management of Laboratory and Pet Animals. Ed. Y. B. Rajeshwari. ISBN: 8189422987.

Exner, C., Bode, H.-J.,Blumer, K., Giese, C. (2007). Animal Experiments in Research. Deutsche Forschungsgemeinschaft. ISBN 978-3-932306-87-7

Statistical evaluation of the flow cytometric micronucleus in vitro test – same but different
Lea AI Vaas1, Robert Smith2, Jeffrey Bemis3, Javed Ahmad2, Steven Bryce3, Christine Marchand4, Roland Froetschl5, Azeddine Elhajouji6, Ulrike Hemmann7, Damian McHugh8, Julia Kenny9, Natalia Sumption9, Andreas Zeller4, Andreas Sutter10, Daniel Roberts11
1Research & Pre-Clinical Statistics Group, Bayer AG, Berlin, Germany; 2Covance Laboratories Ltd., Harrogate, North Yorkshire, UK; 3Litron Laboratories, Rochester, NY, USA; 4Pharmaceutical Sciences, pRED Innovation Center Basel, F. Hoffmann-La Roche Ltd, Basel, Switzerland; 5Federal Institute for Drugs and Medical Devices (BfArM), Bonn, Germany; 6Preclinical Safety (PCS), Novartis Institutes for BioMedical Research (NIBR), Basel, Switzerland; 7Sanofi-Aventis Deutschland GmbH, Frankfurt, Germany; 8Philip Morris Products S.A., Neuchatel, Switzerland; 9Genetic Toxicology and Photosafety, GlaxoSmithKline, Ware, Hertfordshire, UK; 10Bayer AG, Pharmaceuticals, Investigational Toxicology, Berlin, Germany; 11Genetic and In Vitro Toxicology, Charles River, Skokie, IL, USA

In vitro genotoxicity testing is part of the safety evaluation required for product registration and the initiation of clinical trials. The OECD Test Guideline 487 gives recommendations for the conduct, analysis and interpretation of the in vitro Mammalian Cell Micronucleus (MN) Test. Historically, in vitro MN data have been generated via microscopic examination of cells after exposure to a chemical following scientifically valid, internationally accepted, study designs, that is labour intensive and time consuming. Flow cytometry is an automated technology capable of scoring greater numbers of cells in relatively short time span and analysing genotoxic effects of clastogenic and/or aneugenic origin. However, when acquiring data using flow cytometry, neither the number of cells being evaluated nor the built-in relative survival metrics (cytotoxicity) have undergone critical evaluation for standardization. Herein, we addressed these topics, focusing on the application of the in vitro MN assay scored by flow cytometry (e.g. MicroFlow®) for regulatory purposes. To do so, an international working group comprising genetic toxicologists and statisticians from diverse industry branches, contract research organizations, academia, and regulatory agencies serves as a forum to address the regulatory and technical aspects of submitting GLP-compliant in vitro MN flow cytometry data to support product development and registration.

We will briefly present our motivation and the envisaged initial goals with a focus on the suitability of built-in cytotoxicity metrics for regulatory submissions. Based on a data set collected from multiple cross-industry laboratories the working group additionally evaluates historical control data, recommendations on appropriate study designs, and reviews statistical methods for determining positive micronucleus test results.

Mouse clinical trials of N=1: Do we reduce too much?
Hannes-Friedrich Ulbrich
Bayer AG, Deutschland

In 2015 the IMI2 7th Call for Proposals requested for ‚A comprehensive ‘paediatric preclinical POC platform’‘ for the development of treatments against cancer in children; ‚mouse N=1 trials‘ had to be part of it. The project (ITCC-P4) was launched in 2017.

Four years later the terminology has evolved to ‚mouse clinical trials‘ (MCT). They are experiments where one PDX model (a derivative of a particular patient’s tumor) gets implanted into a number of mice to grow and to be treated by different substances: one mouse per substance [and occasionally more for the vehicle — ITCC-P4 plans with three]. The number of PDX of the same human tumor type is supposed to be „large“; the series of randomized per-patient-tumor experiments are forming a trial. As compared to more ‚classical‘ PDX trials where replicates of mice (usually 6) per substance were used to explore substance differences for one PDX only, mouse clinical trials focus on population response for the considered tumor type. This design is still quite new, „becoming wildly used in pre-clinical oncology drug development, but a statistical framework is yet to be developed“ (Guo et al, 2019). Not too much is published yet on whether the reduction to N=1 is reasonable as compared to an imaginable series of ‚classical‘ PDX trials.

Based on data of the already finished OncoTrack IMI project (on colon cancer) we explore the magnitude of differences between the two approaches using resampling techniques.

In this talk we will report the results of this comparison. Statistical models will be described; criteria for comparing these approaches will be discussed.


• IMI2 ITCC-P4 Project Description

• Guo S et al (2019): Mouse clinical trials in oncology drug development. BMC Cancer 19:718, DOI 10.1186/s12885-019-5907-7

• Williams JA (2017) Patient-Derived Xenografts as Cancer Models for Preclinical Drug Screening, DOI 10.1007/978-3-319-55825-7_10

Statistical Review of Animal trials in Ethics Committees – A Guideline
Sophie K. Piper1,2, Dario Zocholl1,2, Robert Röhle1,2, Andrea Stroux1,2, Ulf Tölch2, Frank Konietschke1,2
1Institute of Biometry and Clinical Epidemiology, Charité – Universitätsmedizin Berlin, Charitéplatz 1, D-10117 Berlin, Germany; 2Berlin Institute of Health (BIH), Anna-Louisa-Karsch Str. 2, 10178 Berlin, Germany

Any experiment or trial involving living organism requires ethical review and agreements. Beyond reviewing medical need and goals of the trial, statistical planning of the design and sample size computations are key review criteria. Errors made in the statistical planning phase can have severe consequences on both the results and conclusions drawn from a trial. Moreover, wrong conclusions therof might proliferate and impact future trials—a rather unethical outcome of any research. Therefore, any trial must be efficient in both a medical and statistical way in answering the questions of interests to be considered as “ethically approvable”.

For clinical trials, ethical review boards are well established. This is, however, not the case for pre-clinical and especially animal trials. While ethical review boards are established within each local authority of animal welfare, most of them do not have an appointed statistician. Moreover, unified standards or guidelines on statistical planning and reporting thereof are currently missing for pre-clinical trials.

It is the aim of our presentation to introduce and discuss

i) the need for proper statistical reviews of animal trials,

ii) a guideline of mandatory ethical review criteria, involving blinding and randomization, and

iii) the need to distinguish the planning of exploratory studies from confirmatory studies in pre-clinical research.

Our statistical criteria for ethical reviews of animal trials have been implemented in a form sheet that has been used from the Landesamt für Gesundheit und Soziales (local authority of animal welfare) in Berlin since 2019. It is online available at

Non Clinical Statistics II

Chairs: Katja Ickstadt and Bernd-Wolfgang Igl

On the Role of Historical Control Data in Preclinical Development
Helena Geys
Johnson & Johnson, Belgium

Historical control databases are established by many companies in order to be able to contextualize results from single studies against previous studies performed under similar conditions, to properly design studies and/or to come up with quality control instruments.

Typical preclinical experiments involve a study of a control group of untreated animals and groups of animals exposed to increasing doses. The ultimate aim is to test for a dose related trend in the response of interest. Usually one would focus on one particular experiment. However, since such experiments are conducted in genetically homogeneous animal strains, historical control data from previous similar experiments are sometimes used in interpreting results of a current study.

The use of historical control data in supporting inferences varies across different assays. For example, in genetic toxicology and safety pharmacology, a response may be considered positive in a specific experiment if the result is outside the distribution of the historical negative control data (95% control limits). Whereas, in carcinogenicity studies, historical control data are particularly useful in classifying tumors as rare or common and for evaluation of disparate findings in dual concurrent controls.

Historical control data are often used to carry out an informal equivalence test, whereby a New Molecular Entity (NME) is considered to be “safe” when the results from the treatment groups fall entirely within the negative control distribution.

In addition, formal statistical procedures have been proposed that allow to incorporate historical control data and to combine them with the current control group in tests trend identification.

Clearly historical control data are playing an important role in preclinical development as quality control and interpretation instrument. Yet, the issue of when and how to use historical control data is still not clear and subject to ongoing debate. In this presentation we will highlight pros and cons and the important role a preclinical statistician can play in this.

A comparison of different statistical strategies for the analysis of data in reproductive toxicology involving historical negative controls
Bernd-Wolfgang Igl, Monika Brüning, Bernd Baier
Boehringer Ingelheim Pharma GmbH & Co. KG, Germany

A fundamental requirement of regulatory bodies for the development of new pharmaceuticals is to perform nonclinical developmental and reproductive toxicology (DART) studies to reveal any possible effect of the test item on mammalian reproduction. Usually DART studies are performed in rats and a further (non-rodent) species and aim to support human clinical trials and market access. General recommendations are given in ICH Guideline S5 allowing various phase-dependent designs for a huge number of parameters. The statistical evaluation of DART data is quite multifaceted due to more or less complex correlation structures between mother and offspring, e.g. maternal weight development, fetus weight, ossification status and number of littermates all in dependence of different test item doses.

Initially, we will sketch a Scrum inspired project that was set-up as a cooperation between Boehringer Ingelheim’s Reproductive Toxicology and Non-Clinical Statistics groups. Then, we will describe the particular role and relevance of historical control data in reproductive toxicology. This will be followed by a presentation of common statistical models and some related open problems. Finally, we will give some simulation-based results on statistical power and sample size for the detection of certain events in DART studies.

A Nonparametric Bayesian Model for Historical Control Data in Reproductive Toxicology
Ludger Sandig1, Bernd Baier2, Bernd-Wolfgang Igl3, Katja Ickstadt4
1Fakultät Statistik, Technische Universität Dortmund; 2Reproductive Toxicology, Nonclinical Drug Safety, Boehringer Ingelheim Pharma GmbH & Co. KG; 3Non-Clinical Statistics, Biostatistics and Data Sciences Europe, Boehringer Ingelheim Pharma GmbH & Co. KG; 4Lehrstuhl für mathematische Statistik und biometrische Anwendungen, Fakultät Statistik, Technische Universität Dortmund

Historical control data are of fundamental importance for the interpretation of developmental and reproductive toxicology studies. Modeling such data presents two challenges: Outcomes are observed on different measurement scales (continuous, counts, categorical) and on multiple nested levels (fetuses within a litter, litters within a group, groups within a set of experiments). We propose a nonparametric Bayesian approach to tackle both of them. By using a hierarchical Dirichlet process mixture model we can capture the dependence structure of observables both within and between litters. Additionally we can accommodate an arbitrary number of variables on arbitrary measurement scales at the fetus level, e.g. fetus weight (continuous) and malformation status (categorical). In a second step we extend the model to incorporate observables at higher levels in the hierarchy, e.g. litter size or maternal weight. Inference in these models is possible using Markov Chain Monte Carlo (MCMC) techniques which we implemented in R. We illustrate our approach on several real-world datasets.

Weightloss as Safety Indicator in Rodents
Tina Lang, Issam Ben Khedhiri
Bayer AG, Germany

In preclinical research, the assessment of animal well-being is crucial to ensure ethical standards and compliance with guidelines. It is a tough task to define rules within which the well-being is deemed ok, and when to claim that the suffering of the animal exceeds a tolerable burden and thus, the animal needs to be sacrificed. Indicators are, e.g., food refusal, listlessness and, most outstanding, body weight.

For rodents, a popular rule states that animals that experiences > 20% body weight loss exceeds limits of tolerable suffering and has to be taken out of the experiment. However, research experiments are of highly various nature (Talbot et al., 2020). An absolute rule for all of them can lead to unnecessary deaths of lab animals that are still within reasonable limits of well-being, but for various reasons fall below the body weight limit.

An additional challenge are studies on juvenile rodents which are still within their growth phase. Here, a weight loss might not be observable, but a reduced weight gain could indicate complications. As a solution, their weight gain is routinely compared to the mean weight gain of a control group of animals. If the weight gain differs by a certain percentage, the animals are excluded from the experiment. In case of frequent weighing and small weight gains in the control group, this leads to mathematically driven exclusion of animals which are fit and healthy.

We propose a different approach of safety monitoring which firstly unify assessment for juvenile and adult animals and secondly compensate for different conditions within different experiments.

If a reasonable control group can be kept within the study design, the body weight within the control group is assumed to be lognormally distributed. Within the interval of mean log body weight plus/minus three standard deviations, about 99.73% of all control animals are expected to be found. We conclude that this interval contains acceptable body weights. As the theoretical means and standard deviations of log body weight are unknown, we checked how their empirical equivalent counterparts perform.

We investigated if the rule leaves all healthy animals in the study and only excludes suffering animals. Our data shows that it outperforms the traditional rules by far. Many animals that would have been excluded by the traditional rules can now stay in the study. Thus, the new rule supports animal welfare, and also increases the power of the experiment.