Track: Track 2

Adaptive Designs II

Chairs: Geraldine Rauch and Gernot Wassmer

Opportunities and limits of optimal group-sequential designs
Maximilian Pilz1, Carolin Herrmann2,3, Geraldine Rauch2,3, Meinhard Kieser1
1Institute of Medical Biometry and Informatics – University of Heidelberg, Germany; 2Institute of Biometry and Clinical Epidemiology – Charité University Medicine Berlin, Germany; 3Berlin Institute of Health

Multi-stage designs for clinical trials are becoming increasingly popular. There are two main reasons for this development. The first is the flexibility to modify the study design during the ongoing trial. This possibility is highly beneficial to avoid the failure of trials whose planning assumptions were enormously wrong. However, an unplanned design modification mid-course can also be performed in a clinical trial that has initially been planned without adaptive elements as long as the conditional error principle is applied. The second reason for the popularity of adaptive designs is the performance improvement that arises by applying a multi-stage design. For instance, an adaptive two-stage design can enormously reduce the expected sample size of a trial compared to a single-stage design.

With regard to this performance reason, a two-stage design can entirely be pre-specified before the trial starts. While this still leaves open the option to modify the design, it is preferred by regulatory authorities. Recent work treats the topic of optimal adaptive designs. While those show the best possible performance, it may be difficult to communicate them to a practitioner and to outline them in a study protocol.

To overcome this problem, simpler optimal group-sequential designs may be an option worth to be considered. Those only consist of two sample sizes (stage one and stage two) and three critical values (early futility, early efficacy, final analysis). Thus, they can easily be described and communicated.

In this talk, we present a variety of examples to investigate whether optimal group-sequential designs are a valid approximation of optimal adaptive designs. We elaborate design properties that can be fulfilled by optimal group-sequential designs without considerable performance deterioration and describe situations where an optimal adaptive design may be more appropriate. Furthermore, we give recommendations of how to specify an optimal two-stage design in the study protocol in order to motivate their application in clinical trials.

Group Sequential Methods for Nonparametric Relative Effects
Claus Peter Nowak1, Tobias Mütze2, Frank Konietschke1
1Charité – Universitätsmedizin Berlin, Germany; 2Novartis Pharma AG, Basel, Switzerland

Late phase clinical trials are occasionally planned with one or more interim analyses to allow for early termination or adaptation of the study. While extensive theory and software has been developed for normal, binary and survival endpoints, there has been comparatively little discussion in the group sequential literature on nonparametric methods outside the time-to-event setting. Focussing on the comparison of two parallel treatment arms, we show that the Wilcoxon-Mann-Whitney test, the Brunner-Munzel test, as well as a test procedure based on the log win odds, a modification of the win ratio, asymptotically follow the canonical joint distribution. Consequently, standard group sequential theory can be applied to plan, analyse and adapt clinical trials based on nonparametric efficacy measures. In addition, simulation studies examining type I error rates confirm the adequacy of the proposed methods for a range of scenarios. Lastly, we apply our methodology to the FREEDOMS clinical trial ( Identifier: NCT00289978), analysing relapse in patients with relapse-remitting multiple sclerosis.

Optimal futility stops in two-stage group-sequential gold-standard designs
Jan Meis, Maximilian Pilz, Meinhard Kieser
Institute of Medical Biometry and Informatics, University of Heidelberg, Heidelberg, Germany

A common critique of non-inferiority trials comparing an experimental treatment to an active control is that they may lack assay sensitivity. This denotes the ability of a trial to distinguish an effective treatment from an ineffective one. The ‚gold-standard‘ non-inferiority trial design circumvents this concern by comparing three groups in a hierarchical testing procedure. First, the experimental treatment is compared to a placebo group in an effort to show superiority. Only if this succeeds, the experimental treatment is tested for non-inferiority against an active control group. Ethical and practical considerations require sample sizes of clinical trials to be as large as necessary, but as small as possible. These considerations come especially pressing in the gold-standard design, as patients are exposed to placebo doses while the control treatment is already known to be effective.

Group sequential trial designs are known to reduce the expected sample size under the alternative hypothesis. In their pioneer work, Schlömer and Brannath (2013) show that the gold-standard design is no exception to this rule. In their paper, they calculate approximately optimal rejection boundaries for the gold-standard design given sample size allocation ratios of the optimal single stage design. We extend their work by relaxing the constraints put on the group allocation ratios and allowing for futility stops at interim. The futility boundaries and the sample size allocation ratios will be considered as optimization parameters, together with the efficacy boundaries. This allows the investigation of the efficiency gain by including the option to stop for futility. Allowing discontinuation of a trial when faced with underwhelming results at an interim analysis has very practical implications in saving resources and sparing patients from being exposed to ineffective treatment. In the gold-standard design, these considerations are especially pronounced. There is a large incentive to prevent further patients from being exposed to placebo treatment when interim results suggest that a confirmatory result in the final analysis becomes unlikely.

Besides the extended design options, we analyse different choices of optimality criteria. The above considerations suggest that the null hypothesis also plays an important role in the judgement of the gold-standard design. Therefore, optimality criteria that incorporate the design performance under the alternative and the null hypothesis are introduced. The results of our numerical optimization procedure for this extended design will be discussed and compared to the findings of Schlömer and Brannath.

Blinded sample size re-estimation in a paired diagnostic study
Maria Stark, Antonia Zapf
University Medical Center Hamburg-Eppendorf, Germany

In a paired confirmatory diagnostic accuracy study, a new experimental test is compared within the same patients to an already existing comparator test. The gold standard defines the true disease status. Hence, each patient undergoes three diagnostic procedures. If feasible and ethically acceptable, regulatory agencies prefer this study design to an unpaired design (CHMP, 2009). The initial sample size calculation is based on assumptions about, among others, the prevalence of the disease and the proportion of discordant test results between the experimental and the comparator test (Miettinen, 1968).

To adjust these assumptions during the study period, an adaptive design for a paired confirmatory diagnostic accuracy study is introduced. This adaptive design is used to re-estimate the prevalence and the proportion of discordant test results to finally re-calculate the sample size. It is a blinded adaptive design as the sensitivity and the specificity of the experimental and comparator test are not re-estimated. Due to the blinding, the type I error rates are not inflated.

An example and a simulation study illustrate the adaptive design. The type I error rate, the power and the sample size of the adaptive design are compared to those of a fixed design. Both designs hold the type I error rate. The adaptive design reaches the advertised power. The fixed design can either be over-or underpowered depending on a possibly wrong assumption regarding the sample size calculation.

The adaptive design compensates inefficiencies of the sample size calculation and therefore it supports to reach the desired study aim.


[1] Committee for Medicinal Products for Human Use, Guideline on clinical evaluation of diagnostic agents. Available at: http://www. 2009, 1-19.

[2] O. S. Miettinen, The matched pairs design in the case of all-or-none responses. 24 2 1968, 339-352.

Sample size calculation and blinded re-estimation for diagnostic accuracy studies considering missing or inconclusive test results
Cordula Blohm1, Peter Schlattmann2, Antonia Zapf3
1Universität Heidelberg; 2Universitätsklinikum Jena; 3Universitätsklinikum Hamburg-Eppendorf


For diagnostic accuracy studies the two independent co-primary endpoints, sensitivity and specificity, are relevant. Both parameters are calculated based on the disease status and the test result that is evaluated either as positive or negative. Sometimes the test result is neither positive nor negative but inconclusive or even missing. There are four frequently used methods of handling missing values available where such results are counted as missing, positive, negative, or false positive and false negative. The first three approaches may lead to an overestimation of both parameters, or either sensitivity or specificity, respectively. In the fourth approach, the intention to diagnose principle (ITD), both parameters decrease and a more realistic picture of the clinical potential of diagnostic tests is provided (Schuetz et al. 2012).

Sensitivity and specificity are also key parameters in sample size calculation of diagnostic accuracy studies and a realistic estimate of them is mandatory for the success of a trial. Therefore, the consideration of inconclusive results in the initial sample size calculation and, especially, in a blinded sample size re-calculation based on an interim analysis could improve trial design.


For sample size calculation, the minimum sensitivity and specificity, the type I error rate and the power are defined. In addition, the expected sensitivity and specificity of the experimental test, the prevalence, and the proportion of inconclusive results are assumed. For the simulation study different scenarios are chosen by varying these parameters. The optimal sample size is calculated according to Stark and Zapf (2020). The inconclusive results are generated independently of disease status and randomly distributed over diseased and non-diseased subjects. The sensitivity and specificity of the experimental test are estimated while considering the four different methods, mentioned above, to handle inconclusive results.

The sample size re-recalculation is performed with a blinded one-time re-estimation of the proportion of inconclusive results. The power, the type I error rate, and the bias of estimated sensitivity and specificity are used as performance measures.


The simulation study aims to evaluate the influence of inconclusive results on the evaluation of diagnostic test accuracy in an adaptive study design. The performance difference of the four methods to handle inconclusive results will be discussed.

Adaptive Designs I

Chairs: Thomas Asendorf and Rene Schmidt

Statistical Issues in Confirmatory Platform Trials
Martin Posch, Elias Meyer, Franz König
Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Austria

Adaptive platform trials provide a framework to simultaneously study multiple treatments in a disease. They are multi-armed trials where interventions can enter and leave the platform based on interim analyses as well as external events, for example, if new treatments become available [3]. The attractiveness of platform trials compared to separate parallel group trials is not only due to operational aspects as a joint trial infrastructure and more efficient patient recruitment, but results also from the possibility to share control groups, to efficiently prune non-efficacious treatments, and to allow for direct comparisons between experimental treatment arms [2]. However, the flexibility of the framework also comes with challenges for statistical inference and interpretation of trial results such as the adaptivity of platform trials (decisions on the addition or dropping of arms cannot be fully pre-specified and may have an impact on the recruitment for the current trial arms), multiplicity issues (due to multiple interventions, endpoints, subgroups and interim analyses) and the use of shared controls (which may be non-concurrent controls or control groups where the control treatment changes over time). We will discuss current controversies and the proposed statistical methodology to address these issues [1,3,4]. Furthermore, we give an overview of the IMI project EU-PEARL (Grant Agreement no. 853966) that aims to establish a general framework for platform trials, including the necessary statistical and methodological tools.

[1] Collignon O., Gartner C., Haidich A.-B., Hemmings R.J., Hofner B., Pétavy F., Posch M., Rantell K., Roes K., Schiel A. Current Statistical Considerations and Regulatory Perspectives on the Planning of Confirmatory Basket, Umbrella, and Platform Trials. Clinical Pharmacology & Therapeutics 107(5), 1059–1067, (2020)

[2] Collignon O., Burman C.F., Posch M., Schiel A.Collaborative platform trials to fight COVID-19: methodological and regulatory considerations for a better societal outcome. Clinical Pharmacology & Therapeutics (to appear)

[3] Meyer E.L., Mesenbrink P., Dunger-Baldauf C., Fülle H.-J., Glimm E., Li Y., Posch M., König F. The Evolution of Master Protocol Clinical Trial Designs: A Systematic Literature Review. Clinical Therapeutics 42(7), 1330–1360, (2020)

[4] Posch, M., & König, F. (2020).  Are p-values Useful to Judge the Evidence Against the Null Hypotheses in Complex Clinical Trials? A Comment on “The Role of p-values in Judging the Strength of Evidence and Realistic Replication Expectations”. Statistics in Biopharmaceutical Research, 1-3, (2002)

Type X Error: Is it time for a new concept?
Cornelia Ursula Kunz
Boehringer Ingelheim Pharma GmbH & Co. KG, Germany

A fundamental principle of how we decide between different trial designs and different test statistics is the control of error rates as defined by Neyman and Pearson, namely the type I error rate and the type II error rate. The first one controlling the probability to reject a true null hypothesis and the second one controlling the probability to not reject a false null hypothesis. When Neyman and Pearson first introduced the concepts of type I and type II error, they could not have predicted the increasing complexity of many trials conducted today and the problems that arise with them.

Modern clinical trials often try to address several clinical objectives at once, hence testing more than one hypothesis. In addition, trial designs are becoming more and more flexible, allowing to adapt ongoing trial by changing, for example, number of treatment arms, target populations, sample sizes and so on. It is also known that in some cases the adaptation of the trial leads to a change of the hypothesis being tested as for example happens when the primary endpoint of the trial is changed at an interim analysis.

While their focus was on finding the most powerful test for a given hypothesis, we nowadays often face the problem of finding the right trial design in the first place before even attempting on finding the most powerful or in some cases even just a test at all. Furthermore, when more than one hypothesis is being tested family-wise type I error control in the weak or strong sense also has to be addressed with different opinions on when we need to control for it and when we might not need to.

Based on some trial examples, we show that the more complex the clinical trial objectives, the more difficult it might be to establish a trial that is actually able to answer the research question. Often it is not sufficient or even possible to translate the trial objectives into simple hypotheses that are then being tested by some most powerful test statistic. However, when the clinical trial objectives cannot completely be addressed by a set of null hypotheses, type I and type II error might not be sufficient anymore to decide on the admissibility of a trial design or test statistic. Hence, we raise the question whether a new kind of error should be introduced.

Control of the population-wise error rate in group sequential trials with multiple populations
Charlie Hillner, Werner Brannath
Competence Center for Clinical Trials, Germany

In precision medicine one is often interested in clinical trials that investigate the efficacy of treatments that are targeted to specific sub-populations defined by genetic and/or clinical biomarkers. When testing hypotheses in multiple populations multiplicity adjustments are needed. First, we propose a new multiple type I error criterion for clinical trials with multiple intersecting populations, which is based on the observation that not all type I errors are relevant to all patients in the overall population. If the sub-populations are disjoint, no adjustment for multiplicity appears necessary, since a claim in one sub-population does not affect patients in the other ones. For intersecting sub-populations we suggest to control the probability that a randomly selected patient will be exposed to an inefficient treatment, which is an average multiple type I error. We propose group sequential designs that control the PWER where possibly multiple treatments are investigated in multiple populations. To this end, an error spending approach that ensures PWER-control is introduced. We exemplify this approach for a setting of two intersecting sub-populations and discuss how the number of different treatments to be tested in each sub-population affects the critical boundaries needed for PWER-control. Lastly, we apply this error spending approach to a group sequential design example from Magnusson & Turnbull (2013), where the efficacy of one treatment is to be tested after a certain sub-population that is likely to benefit from the treatment is found. We compare our PWER-controlling method with their FWER-controlling method in terms of critical boundaries and the resulting rejection probabilities and expected information.


Magnusson, B.P. and Turnbull, B.W. (2013), Group sequential enrichment design incorporating subgroup selection. Statist. Med., 32: 2695-2714.

Adaptive group sequential designs for phase II trials with multiple time-to-event endpoints
Moritz Fabian Danzer1, Tobias Terzer2, Andreas Faldum1, Rene Schmidt1
1Institute of Biostatistics and Clinical Research, University of Münster, Germany; 2Division of Biostatistics, German Cancer Research Center, Heidelberg, Germany

Existing methods concerning the assessment of long-term survival outcomes in one-armed trials are commonly restricted to one primary endpoint. Corresponding adaptive designs suffer from limitations regarding the use of information from other endpoints in interim design changes. Here we provide adaptive group sequential one-sample tests for testing hypotheses on the multivariate survival distribution derived from multi-state models, while making provision for data-dependent design modifications based on all involved time-to-event endpoints. We explicitly elaborate application of the methodology to one-sample tests for the joint distribution of (i) progression-free survival (PFS) and overall survival (OS) in the context of an illnessdeath model, and (ii) time to toxicity and time to progression while accounting for death as a competing event. Large sample distributions are derived using a counting process approach. Small sample properties and sample size planning are studied by simulation. An already established multi-state model for non-small cell lung cancer is used to illustrate the adaptive procedure.

Poster Session

Sample Size in Bioequivalence Cross-Over Trials with Balanced Incomplete Block Design
Lina Hahn1, Gerhard Nehmiz2, Jan Beyersmann1, Salome Mack2
1Universität Ulm, Institut für Statistik, Deutschland; 2Boehringer Ingelheim Pharma GmbH&Co. KG, Biberach

In cross-over trials, all subjects receiving the same sequence of treatments form one sequence group, so there are s sequence groups of equal size n/s. If, due to limitations, the number of periods (p) is smaller than the number of treatments (t), we have an Incomplete Block Design. If the allocation of treatments to sequence groups is balanced, it is a Balanced Incomplete Block Design (BIBD).

Necessary conditions are: If r is the number of different sequences each treatment appears in, and if lambda is the number of different sequences in which each treatment pair occurs, balance implies r = p * s / t and lambda = r * (p-1) / (t-1). BIBDs exist for any t and p but can become large (Finney 1963). We investigate two examples with reasonable size, an internal one and the example of Senn (1997).

In medical trials, furthermore, period effects are likely, and in the BIBD the allocation of treatments to periods has also to be balanced, so that treatment contrasts can be estimated unbiasedly. Sufficient is that s is an integer multiple of t, or equivalently r is an integer multiple of p (Hartley 1953). Our two examples fulfil this.

Let the linear model for the measurements y be as usual with i.i.d. random subject effects and fixed terms for treatment, period and sequence group (absorbed by subject effect). All treatment contrasts can then be estimated in an unbiased manner, and if all error terms are i.i.d. N(0,sigma_e^2) and the subject effects are independent from these, the variance of the contrasts can be estimated as well. While in a complete cross-over the contrast variance is 2*sigma_e^2, it is in a BIBD generically b_k*sigma_e^2 where b_k is the “design factor”. We obtain b_k = (2*p*s) / (lambda*t).

Bioequivalence is investigated through the two one-sided tests procedure (TOST) for a treatment contrast (Schuirmann 1987). We investigate the power of the TOST in the two examples, considering the t distribution (Shen 2015, Labes 2020) and comparing it with a previous normal approximation which induces slight underpowerment.

Untersuchung der Qualität der Berichterstattung in RCT Abstracts zu COVID-19 nach CONSORT (CoCo- Studie) – Zwischenbericht eines Reviews
Sabrina Tulka, Christine Baulig, Stephanie Knippschild
Lehrstuhl für Medizinische Biometrie und Epidemiologie, Universität Witten/Herdecke, Germany

Hintergrund: Im Jahr 2020 führte die globale COVID-19- Krise aufgrund ihrer Brisanz und Dringlichkeit zu beschleunigter Forschungstätigkeit und Peer-Review-Verfahren. Obwohl die Volltexte zurzeit frei verfügbar sind, sind diese nicht automatisch auch frei zugänglich (z.B. nicht englischsprachig verfasst)! Zusätzlich zwingt ein hoher Zeitdruck medizinisches Personal oftmals dazu, sich ausschließlich über Abstracts einen ersten Überblick in speziellen Themengebieten zu verschaffen. Hierdurch kommt den Abstracts eine Schlüsselrolle zu und bildet nicht selten die Grundlage für Entscheidungen. Das CONSORT-Statement für Abstracts stellt allen Autoren einen Leitfaden zur Verfügung, um die Qualität (Vollständigkeit und Transparenz) der Berichterstattung medizinischer Forschung (auch in Abstracts) zu gewährleisten. Ziel dieser Studie war es die Vollständigkeit in den Abstracts zu allen bisher veröffentlichten COVID-19 RCTs zu untersuchen.

Methoden: Mittels Literaturrecherche in PubMed und Embase wurden alle Publikationen bis zum 29.10.2020 gesucht und hinsichtlich des Themengebietes (berichtet Ergebnisse zu Corona-Studien) und ihres Studiendesigns (RCT) überprüft. Anschließend erfolgte für geeignete Publikationen zum einen die Untersuchung auf Vollständigkeit der Informationen (Information generell aufzufinden) und zum anderen die Prüfung auf Korrektheit (Informationen, gemäß CONSORT für Abstracts berichtet). Grundlage stellte die CONSORT Checkliste für RCT-Abstracts mit insgesamt 16 Items dar. Die Prüfung erfolgte unabhängig durch zwei Bewerter und wurde anschließend konsentiert. Primärer Endpunkt der Studie war der Anteil korrekt umgesetzter CONSORT-Items. Sekundär wurde die Häufigkeit der korrekten Berichterstattung jedes einzelnen Items geprüft.

Ergebnisse: Von insgesamt 88 Publikationen konnten 30 als Veröffentlichung einer RCT in die Analyse eingeschlossen werden. Im Median berichteten die untersuchten Abstracts einen Anteil von 63% der geforderten Kriterien (Quartilspanne: 44% bis 88%, Minimum: 25%, Maximum: 100%). Korrekt umgesetzt wurden im Median 50% der Kriterien (Quartilspanne: 31% bis 70%, Minimum: 12.5%, Maximum: 87.5%). Die „Anzahl der analysierten Patienten“ (20%) und „vollständige Ergebnisse zum primären Endpunkt“ (37%) wurden am seltensten berichtet. Angaben zur Intervention waren in 97% der Abstracts zu finden, aber nur in 43% der Abstracts auch korrekt (vollständig).

Diskussion: Es zeigte sich, dass die Hälfte aller Abstracts maximal die Hälfte aller notwendigen Informationen enthielt. Als besonders auffällig sind hier die Unvollständigkeit zur finalen Patientenzahl, der Ergebnispräsentation sowie zu den jeweils eingesetzten Therapieansätzen für alle Gruppen hervor zu heben. Da (trotz häufiger Verfügbarkeit) in einer schnelllebigen Krisensituation, wie der COVID-19-Pandemie, nur wenig Zeit für eine vollständige und kritische Sichtung aller Volltexte vorhanden ist, müssen Informationen in Studienabstracts vollständig und transparent beschrieben sein. Unsere Untersuchung zeigt, dass ein deutlicher Handlungsbedarf hinsichtlich der Berichtqualität in Abstracts besteht und stellt einen Apell an alle Autoren dar.

A Scrum related interdisciplinary project between Reproductive Toxicology and Nonclinical Statistics to improve data transfer, statistical strategy and knowledge generation
Monika Brüning1, Bernd Baier2, Eugen Fischer2, Gaurav Berry2, Bernd-Wolfgang Igl1
1Nonclinical Statistics, Biostatistics and Data Sciences, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riss, Germany; 2Reproductive Toxicology, Nonclinical Drug Safety, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riss, Germany

The development of a new drug is a long journey full of important milestones that have to be reached successfully. After identification of a pharmaceutical candidate substance, nonclinical developmental and reproductive toxicology (DART) studies are one important element to assess the safety of the future drug. Recommendations on study design and conduct are given in ICH Guideline S5 to support human clinical trials and market access of new pharmaceuticals involving various phase-dependent designs incl. a huge number of parameters.

DART studies have to be performed in animal models and aim to detect any effect of the test item on mammalian reproduction relevant for human risk assessment.

In general, reproductive toxicology study data involve complex correlation structures between mother and offspring, e.g. maternal weight development, fetus weight, ossification status and number of littermates all in dependence of different test item doses. Thus, from a statistical point of view, DART studies are highly demanding and interesting. This complexity is not reflected in statistical approaches implemented in standard lab software.

To this end, we have developed a Scrum inspired project to intensify the cooperation between Reproductive Toxicology and Nonclinical Statistics to work according to agile principles. Therein, we e.g. defined processes for data transfer and analysis incl. a sophisticated and scientifically state-of-the-art statistical methodology.

In this work, we will mainly focus on technical aspects for constructing an Analysis Data Set (ADS) involving regulatory requirements by CDISC SEND (Standard for Exchange of Nonclinical Data), but also sketch new concepts for visualization and statistical analysis.

Statistical Cure of Cancer in Schleswig-Holstein
Johann Mattutat1, Nora Eisemann2, Alexander Katalinic1,2
1Institute for Cancer Epidemiology, University of Lübeck, Germany; 2Institute of Social Medicine and Epidemiology, University of Lübeck, Germany

Cancer patients who have survived their treatment and who have been released into remission still live with the uncertainty of late recurrences of their disease. Yet, studies have shown that the observed mortality in the patient group converges against the overall population mortality after some time for most cancer entities. The amount of excess mortality and its time course can be estimated. The time point at which it falls below a defined threshold can be interpreted as „statistical cancer cure“.

This contribution shall focus on the workflow estimating the time point of statistical cancer cure. We briefly explain each step, describe design choices, and report some exemplary results for colorectal cancer. The calculations are based on data provided by the cancer registry of Schleswig-Holstein. First, a threshold for “statistical cancer cure” is defined. Then, missing information on tumor stage at diagnosis is imputed using multiple imputation. The net survival depending on cancer entity, sex, tumor stage at diagnosis (UICC) and age is estimated using a flexible excess hazard regression model implemented in the R package “mexhaz”. The excess mortality is derived using Gauss-Legendre quadrature. Finally, survival rates are estimated conditionally on the survival time since diagnosis and the defined thresholds are applied to get the estimated time point of cure.

We focus on the probability of belonging to the group that will suffer future excess mortality and define the time point at which this probability falls below 5% as time of “statistical cancer cure”. For the example of colorectal cancer (C18-C21) diagnosed in a local stage (UICC II), this probability amounts to approximately 16% at the time of diagnosis and statistical cure is reached after 4.2 years.

Results like the ones described above may support cancer patients by removing uncertainty regarding their future prognosis. As a subsequent step, comprehensive data covering most of the common cancer entities are to be generated.

Biometrical challenges of the Use Case of the Medical Informatics Initiative (MI-I) on „POLypharmacy, drug interActions, Risks” (POLAR_MI)
Miriam Kesselmeier1, Martin Boeker2, Julia Gantner1, Markus Löffler3, Frank Meineke3, Thomas Peschel3, Jens Przybilla3, André Scherag1, Susann Schulze4, Judith Schuster3, Samira Zeynalova3, Daniela Zöller2
1Institute of Medical Statistics, Computer and Data Sciences, Jena University Hospital, Jena, Germany; 2Institute of Medical Biometry and Statistics, University of Freiburg, Freiburg, Germany; 3Institute for Medical Informatics, Statistics and Epidemiology, University Leipzig, Leipzig, Germany; 4Geschäftsbereich Informationstechnologie, Universitätsklinikum Hamburg-Eppendorf, Hamburg, Germany

Introduction: The aim of POLAR_MI is to use (and, where necessary, adapt/develop) methods and processes of the MI-I to contribute to the detection of health risks in patients with polypharmacy. Polypharmacy occurs especially in elderly patients with multi-morbidity. It is associated with an increased risk for medication errors and drug-drug or drug-disease interactions, which either reduce or intensify the desired effect of individual active substances or lead to undesired adverse drug effects. The project involves an interdisciplinary team ranging from medical informatics to pharmacy and clinical pharmacology with experts from 21 institutions, among them 13 university hospitals and their data integration centres (DICs). Here we focus on some of the biometrical challenges of POLAR_MI.

Material and methods: POLAR_MI relies on the infrastructure of the DICs. The tasks of a DIC include the transfer of data from a wide range of data-providing systems, their interoperable integration and processing while ensuring data quality and data protection. Ultimately, DICs should contribute to a data sharing culture in medicine along the FAIR principles. POLAR_MI is designed to utilize (anonymous) data conforming to the MI-I core data set specification. The generic biometrical concept foresees a two-step procedure: 1) Aggregation (including analysis) of individual patient-level data locally at each DIC using distributed computing mechanisms and shared algorithms, because the security of personal data cannot be guaranteed by anonymisation alone. 2) Afterwards, combination of aggregated data across all contributing DICs. The formulation of these steps requires continuous feedback from the other working groups of POLAR_MI.

Results: To practically implement the biometric concept, we have developed multiple, small iterative steps that alternate between pharma and DIC team. These steps are addressed by pilot data use projects initially limited to single DICs. The steps cover definitions of potentially drug-related events (like falls, delirium or acute renal insufficiency), active substances, outcomes and related value ranges or requirements regarding data privacy issues. Additionally, the handling of missing data, possibly non-ignorable heterogeneity between the DICs as well as inclusion and exclusion criteria for the different research questions of POLAR_MI were discussed.

Conclusion: Based on the results of the pilot data use projects, analyses (approaches) of the main hypotheses of POLAR_MI will be developed. The iterative workflow and the necessary steps presented here may serve as a blue print for other projects using real world data.

DNT: An R package for differential network testing, with an application to intensive care medicine
Roman Schefzik, Leonie Boland, Bianka Hahn, Thomas Kirschning, Holger Lindner, Manfred Thiel, Verena Schneider-Lindner
Medical Faculty Mannheim, Heidelberg University, Germany

Statistical network analyses have become popular in many scientific disciplines, where a specific and important task is to test for significant differences between two networks. In our R package DNT, which will be made available at, we implement an overall frame for differential network testing procedures that differ with respect to (1) the network estimation method (typically based on specific concepts of association) and (2) the network characteristic employed to measure the difference. Using permutation-based tests with variants for paired and unpaired settings, our approach is general and applicable to various overall, node-specific or edge-specific network difference characteristics. Moreover, tools for visual comparison of two networks are implemented. Along with the package, we provide a corresponding user-friendly R Shiny application.

Exemplarily, we demonstrate the usefulness of our package in a novel application to a specific issue in intensive care medicine. In particular, we show that statistical network comparisons based on parameters representing the main organ systems are beneficial for the evaluation of the prognosis of critically ill patients in the intensive care unit (ICU), using patient data from the surgical ICU of the University Medical Centre Mannheim, Germany. We specifically consider both cross-sectional comparisons between a non-survivor and a survivor group (identified from the electronic medical records by using a combined risk set sampling and propensity score matching) and longitudinal comparisons at two different, clinically relevant time points during the ICU stay: first, after admission, and second, at an event stage prior to death in non-survivors or a matching time point in survivors. While specific outcomes depend on the considered network estimation method and network difference characteristic, there are however some overarching observations. For instance, we overall discover relevant changes of organ system interactions in critically ill patients in the course of ICU treatment in that while the network structures at admission stage tend to look fairly similar among survivors and non-survivors, the corresponding networks at event stage differ substantially. In particular, organ system interactions appear to stabilize for survivors, while they do not or even deteriorate for non-survivors. Moreover, on an edge-specific level, a positive association between creatinine and C-reactive protein is typically present in all the considered networks except for the non-survivor networks at the event stage.

Timur Tug1, Annette Bitsch2, Frank Bringezu3, Steffi Chang4, Julia Duda1, Martina Dammann5, Roland Frötschl6, Volker Harm7, Bernd-Wolfgang Igl8, Marco Jarzombek13, Rupert Kellner2, Fabian Kriegel13, Jasmin Lott8, Stefan Pfuhler9, Ulla Plappert-Helbig10, Markus Schulz4, Lea Vaas7, Marie Vasquez12, Dietmar Zellner8, Christina Ziemann2, Verena Ziegler11, Katja Ickstadt1
1Department of Statistics, TU Dortmund University, Dortmund, Germany; 2Fraunhofer Institute for Toxicology and Experimental Medicine ITEM, Hannover, Germany; 3Merck KGaA, Biopharma and Non-Clinical Safety, Darmstadt, Germany; 4ICCR-Roßdorf GmbH, Rossdorf, Germany; 5BASF SE, Ludwigshafen am Rhein, Germany; 6Federal Institute for Drugs and Medical Devices (BfArM), Bonn, Germany; 7Bayer AG, Berlin, Germany; 8Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riss, Germany; 9Procter & Gamble, Cincinnati, Ohio, USA; 10Lörrach, Germany; 11Bayer AG, Wuppertal, Germany; 12Helix3 Inc, Morrisville, NC, USA; 13NUVISAN ICB GmbH, Preclinical Compound Profiling,Germany

The in vivo alkaline Comet or single cell gel electrophoresis assay is a standard test in genetic toxicology for measuring DNA damage and repair at an individual cell level. It is a sensitive, fast and simple method to detect single or double strand breaks and therefore, a widespread technique used in several regulatory frameworks today. In 2016, several nonclinical statisticians and toxicologists from academia, industry and one regulatory body founded a working group “Statistics” within the “Gesellschaft für Umwelt-Mutationsforschung e.V.” (GUM). Currently, this interdisciplinary group has collected data from more than 200 experiments performed in various companies to take a closer look on various aspects of the statistical analysis of Comet data.

In this work, we will sketch the assay and related data processing strategies itself. Moreover, we will briefly describe the effect of different summarizing techniques for transferring data from the cell to the slide or animal level, which might influence the final outcome of the test dramatically. Finally, we will present results of various inferential statistical models incl. their comparisons with a special focus on the involvement of historical control data.

Meta-Cox-regression in DataSHIELD – Federated time-to-event-analysis under data protection constraints
Ghislain N. Sofack1, Daniela Zöller1, Saskia Kiefer1, Denis Gebele1, Sebastian Fähndrich2, Friedrich Kadgien2, Dennis Hasenpflug3
1Institut für Medizinische Biometrie und Statistik (IMBI),Universität Freiburg; 2Department Innere Medizin, Universitätsklinikum Freiburg; 3Datenintegrationszentrum, Philipps-Universität Marburg


Studies published so far suggest that Chronic Obstructive Pulmonary Disease (COPD) may be associated with higher rates of mortality in patients with coronavirus disease 2019 (COVID-19). However, the number of cases at a single site is often rather small, making statistical analysis challenging. To address this problem, the data from several sites from the MIRACUM consortium shall be combined. Due to the sensitivity of individual-level data, ethical and practical considerations related to data transmission, and institutional policies, individual-level data cannot be shared. As an alternative, the DataSHIELD framework based on the statistical programming language R can be used. Here, the individual-level data remain within each site and only anonymous aggregated data are shared.

Problem statement

Up to now, no time-to-event analysis methods are implemented in DataSHIELD. We aim at implementing a meta-regression approach based on the Cox-model in DataSHIELD where only anonymous aggregated data are shared, while simultaneously allowing for explorative, interactive modelling. The approach will be exemplarily applied to explore differences in survival between COVID-19 patients with and those without COPD.


Firstly, we present the development of a server-side and client-side DataSHIELD package for calculating survival objects and performing the Cox proportional hazard regression model on individual data at each site. The sensitive patient-level data stored in each server will be processed locally on R studio and only the less-sensitive intermediate statistics like the coefficient’s matrices and the Variance Covariance matrices are exchanged and combined via Study Level Meta-Analysis (SLMA) regression techniques to obtain a global analysis. We will demonstrate the process of evaluating the output of the local Cox-regressions for data protection breaches. Exemplarily, we will show the results for comparing the survival of COVID-19 patients with and without COPD using the COVID-19 data distributed across different sites of the MIRACUM consortium.


In conclusion, we provide an implementation for SLMA Cox regression in the DataSHIELD framework to enable explorative and interactive modelling for distributed survival data under data protection constraints. We exemplarily demonstrate its applicability to data from the MIRACUM consortium. By demonstrating the process of evaluating the output of the Cox regression for data protection breaches, we rise awareness for the problem.

Quantification of severity of alcohol harms from others‘ drinking items using item response theory (IRT)
Ulrike Grittner1, Kim Bloomfield1,2,3,4, Sandra Kuntsche5, Sarah Callinan5, Oliver Stanesby5, Gerhard Gmel6,7,8,9
1Institute of Biometry and Clinical Epidemiology, Charité – Universitätsmedizin Berlin, Germany; 2Centre for Alcohol and Drug Research, Aarhus University, Denmark; 3Research Unit for Health Promotion, University of Southern Denmark, Denmark; 4Alcohol Research Group, Emeryville, CA, USA; 5Centre for Alcohol Policy Research, La Trobe University, Melbourne, Australia; 6Alcohol Treatment Centre, Lausanne University Hospital CHUV, Lausanne, Switzerland; 7Addiction Switzerland, Research Department, Lausanne, Switzerland; 8Centre for Addiction and Mental Health, Institute for Mental Health Policy Research, Toronto, Ontario, Canada; 9University of the West of England, Faculty of Health and Applied Science, Bristol, United Kingdom

Background: Others’ heavy drinking might negatively affect quality of life, mental and physical health as well as work and family situation. However, until now there is little known about which of these experiences is seen as most or least harmful, and who is most affected.

Methods: Data stem from large population-based surveys from 10 countries of the GENAHTO project (GENAHTO: Gender & Alcohol’s Harms to Others, ). Questions about harms from others’ heavy drinking concern verbal and physical harm, damage of clothing, belongings or properties, traffic accidents, harassment, threatening behaviour, family problems, problems with friends, problems at work, and financial problems. We used item response theory (IRT) methods (two-parameter logistic (2PL) model) to allow for scaling of the aforementioned items for each country separately. To acknowledge culturally-related sensibilities to experiences of harms in different countries, we also used differential item functioning (DIF). This resulted in country-wise standardised person-based parameters for each individual of each country indicating a quantified measure of load of AHTO. In multiple linear mixed models (random intercept for country) we analysed how load of AHTO was related to sex, age, own drinking and education.

Results: Younger age, female sex and higher level of own drinking were related to a higher load of AHTO. However, interaction of age and own drinking indicated that only for younger age did own drinking level play a role.

Conclusions: Using IRT, we were able to evaluate differing grades of severity in the experiences of harm from others’ heavy drinking.

A note on Rogan-Gladen estimate of the prevalence
Barbora Kessel, Berit Lange
Helmholtz Zentrum für Infektionsforschung, Germany

When estimating prevalence based on data obtained by an imperfect diagnostic test, an adjustment for the sensitivity and specificity of the test is desired. The test characteristics are usually determined in a validation study and are known only with an uncertainty, which should be accounted for as well. The classical Rogan-Gladen correction [4] comes with an approximate confidence interval based on normality and the delta method. However, in literature it was found to have lower than the nominal coverage when prevalence is low and both sensitivity and specificity are close to 1 [2]. In a recent simulation study [1] the empirical coverage of a nominal 95% Rogan-Gladen confidence interval was mostly below 90% and as low as 70% over a wide range of setups. These results are much worse than those reported in [2] and make Rogan-Gladen interval not recommendable in practice. Since we are interested in applying the Rogan-Gladen method to estimate seroprevalence of SARS-CoV-2 infections, like it was done e.g. in [5], we will present detailed simulation results clarifying the properties of the Rogan-Gladen method in setups with low true prevalences and high specificities as being seen in the current seroprevalence studies of SARS-CoV-2 infections, see e.g. the overview [3]. We will also take into account that in the actual studies, the final estimate is often a weighted average of prevalences in subgroups of the population. We will make recommendations when the modification of the procedure suggested by Lang and Reiczigel [2] is necessary. To conclude, we would like to note that since the uncertainties in the sensitivity and specificity estimates used for the correction influence the uncertainty of the corrected prevalence, it is highly desirable to always state not only the values of the test characteristics but also the values of their uncertainties used for the correction. Reporting also the crude uncorrected prevalences enhances the future re-use of the results e.g. in meta-analyses.

[1] Flor M, Weiß M, Selhorst T et al (2020). BMC Public Health 20:1135, doi: 10.1186/s12889-020-09177-4

[2] Lang Z, Reiczigel J. (2014). Preventive Veterinary Medicine 113, pp. 13–22, doi: 10.1016/j.prevetmed.2013.09.015

[3] Neuhauser H, Thamm R, Buttmann-Schweiger N et al. (2020). Epid Bull 50, pp. 3–6; doi: 10.25646/7728

[4] Rogan WJ, Gladen B (1978). American Journal of Epidemiology 107(1), pp. 71–76, doi: 10.1093/oxfordjournals.aje.a112510

[5] Santos-Hövener C, Neuhauser HK, Schaffrath Rosario A et al. (2020). Euro Surveill 25(47):pii=2001752, doi: 10.2807/1560-7917.ES.2020.25.47.2001752

On variance estimation for the one-sample log-rank test
Moritz Fabian Danzer, Andreas Faldum, Rene Schmidt
Institute of Biostatistics and Clinical Research, University of Münster, Germany

Time-to-event endpoints show an increasing popularity in phase II cancer trials. The standard statistical tool for such endpoints in one-armed trials is the one-sample log-rank test. It is widely known, that the asymptotic providing the correctness of this test does not come into effect to full extent for small sample sizes. There have already been some attempts to solve this problem. While some do not allow easy power and sample size calculations, others lack a clear theoretical motivation and require further considerations. The problem itself can partly be attributed to the dependence of the compensated counting process and its variance estimator. We provide a framework in which the variance estimator can be flexibly adopted to the present situation while maintaining its asymptotical properties. We exemplarily suggest a variance estimator which is uncorrelated to the compensated counting process. Furthermore, we provide sample size and power calculations for any approach fitting into our framework. Finally, we compare several methods via simulation studies and the hypothetical setup of a Phase II trial based on real world data.

Visualizing uncertainty in diagnostic accuracy studies using comparison regions
Werner Vach1, Maren Eckert2
1Basel Academy for Quality and Research in Medicine, Switzerland; 2Institute of Medical Biometry and Statistics, University of Freiburg, Germany

The analysis of diagnostic accuracy can be often seen as a two-dimensional estimation problem. The interest is in pairs such as sensitivity and specificity, positive and negative predictive value, or positive and negative likelihood ratio. In visualizing the joint uncertainty in the two-parameter estimate, confidence regions are an obvious choice.

However, Eckert and Vach (2020) recently pointed out, that this a suboptimal approach. Two-dimensional confidence regions support the post-hoc testing of point hypotheses, whereas the evaluation of diagnostic accuracy is related to testing hypotheses on linear combination of parameters (Vach et al. 2012). Consequently, Eckert and Vach suggest the use of comparison regions, supporting such post-hoc tests.

In this poster we illustrate the use of comparison regions in visualizing uncertainty using the results of a published paired diagnostic accuracy study (Ng et al 2008) and contrast it with the use of confidence regions. Both LR-test based and Wald-test based regions are considered. The regions are supplemented by (reference) lines that allow judging possible statements about certain weighted averages of the parameters of interest. We consider the change in sensitivity and specificity as well as the change in the relative frequency of true positive and false positive test results. As the prevalence of the disease state of interest is low in this study, the two approaches give very different results.

Finally, we give some recommendation on the use of comparison regions in analysing diagnostic accuracy studies.


Eckert M, Vach W. On the use of comparison regions in visualizing stochastic uncertaint in some two‐parameter estimation problems. Biometrical Journal. 2020; 62: 598–609.

Vach W, Gerke O, Høilund-Carlsen PF. Three principles to define the success of a diagnostic study could be identified. J Clin Epidemiol. 2012; 65:293-300.

Ng SH, Chan SC, Liao CT, Chang JT, Ko SF, Wang HM, Chin SC, Lin CY, Huang SF, Yen TC. Distant metastases and synchronous second primary tumors in patients with newly diagnosed oropharyngeal and hypopharyngeal carcinomas: evaluation of (18)F-FDG PET and extended-field multi-detector row CT. Neuroradiology 2008; 50:969-79.

Panel Discussion: Networking – Bessere Karrierechancen durch Netzwerken?

Chairs: Bjoern-Hergen Laabs and Stefanie Peschel

Ob in der Wirtschaft oder Wissenschaft – viele Beschäftigte wären wohl ohne ein gutes berufliches Netzwerk nicht in ihrer heutigen Position. Aber ist ein umfangreiches Netzwerk wirklich notwendig oder sind gute Leistungen ausreichend? Hätten sich dieselben Karrieretüren vielleicht auch ohne ein gutes Netzwerk geöffnet? Diesen und vielen weiteren Fragen rund um das Thema Netzwerken wollen wir in der Podiumsdiskussion auf dem 67. Biometrischen Kolloquium nachgehen.

Wir freuen uns, die folgenden Gäste auf unserem virtuellen Podium begrüßen zu dürfen:

  • Dr. Ralph Brinks (Universität Witten/Herdecke)
  • Dr. Ronja Foraita (Leibniz-Institut für Präventionsforschung und Epidemiologie – BIPS, Bremen)
  • Dr. Anke Huels (Emory University, Atlanta, USA)
  • Minh-Anh Le (Munich RE, München)
  • Prof. Dr. Christian L. Müller (Helmholtz Zentrum München; LMU München; Simons Foundation, New York, USA)
  • Jessica Rohmann (Charité – Universitätsmedizin Berlin)

Unter der Moderation von Stefanie Peschel (Helmholtz Zentrum München), möchten wir das Thema Networking zusammen mit unseren Gästen von verschiedenen Seiten beleuchten. Dabei soll es unter anderem um folgende Themen gehen:

  • Twitter und Co. – Wie nutze ich Social-Media-Kanäle richtig?
  • LinkedIn und Xing – Wie stelle ich mich gut dar?
  • Mitgliedschaft bei Fachgesellschaften – Welche Möglichkeiten des Netzwerkens ergeben sich daraus?
  • Networking in virtuellen Meetings – Geht das überhaupt?
  • Weitere Möglichkeiten – Gibt es noch andere Wege, ein gutes Netzwerk aufzubauen?

Da diese Themen mit Sicherheit nicht alle Fragen unseres Publikums abdecken, laden wir unsere Zuhörerschaft herzlich ein, Fragen an die Diskutant*innen zu stellen. Dafür wird es während der Session eine Fragerunde geben und auch über unseren Chat können jederzeit Fragen gestellt werden. Aber auch nach der Veranstaltung wird es noch die Möglichkeit geben, mit unseren Gästen in Kontakt zu treten. Gerne dürfen auch schon vorab Fragen per E-Mail an gestellt werden.

Die Veranstaltung findet auf Deutsch statt. Organisiert wird die Podiumsdiskussion von der AG Nachwuchs der IBS-DR.

Data Sharing and Reproducible Research

Chairs: Ronja Foraita and Iris Pigeot

The Statistical Assessment of Replication Success
Leonhard Held
Epidemiology, Biostatistics and Prevention Institute (EBPI) and Center for Reproducible Science (CRS), University of Zurich

Replicability of research findings is crucial to the credibility of all empirical domains of science. However, there is no established standard how to assess replication success and in practice many different approaches are used. Statistical significance of both the original and replication study is known as the two-trials rule in drug regulation but does not take the corresponding effect sizes into account.

We compare the two-trials rule with the sceptical p-value (Held, 2020), an attractive compromise between hypothesis testing and estimation. This approach penalizes shrinkage of the replication effect estimate compared to the original one, while ensuring that both are also statistically significant to some extent. We describe a recalibration of the procedure as proposed in Held et al (2020), the golden level. The golden level guarantees that borderline significant original studies can only be replicated successfully if the replication effect estimate is larger than the original one. The recalibrated sceptical p-value offers uniform gains in project power compared to the two-trials rule and controls the Type-I error rate except for very small replication sample sizes. An application to data from four large replication projects shows that the new approach leads to more appropriate inferences, as it penalizes shrinkage of the replication estimate compared to the original one, while ensuring that both effect estimates are sufficiently convincing on their own. Finally we describe how the approach can also be used to design the replication study based on specification of the minimum relative effect size to achieve replication success.

Held, Leonhard (2020) A new standard for the analysis and design of replication studies (with discussion). Journal of the Royal Statistical Society, Series A, 183:431–469.

Held, Leonhard and Micheloud, Charlotte and Pawel, Samuel (2020). The assessment of replication success based on relative effect size.

Multivariate regression modelling with global and cohort-specific effects in a federated setting with data protection constraints
Max Behrens, Daniela Zöller
University of Freiburg, Germany

Multi-cohort studies are an important tool to study effects on a large sample size and to identify cohort-specific effects. Thus, researchers would like to share information between cohorts and research institutes. However, data protection constraints forbid the exchange of individual-level data between different research institutes. To circumvent this problem, only non-disclosive aggregated data is exchanged, which is often done manually and requires explicit permission before transfer. The framework DataSHIELD enables automatic exchange in iterative calls, but methods for performing more complex tasks such as federated optimisation and boosting techniques are missing.

We propose an iterative optimization of multivariate regression models which condenses global (cohort-unspecific) and local (cohort-specific) predictors. This approach will be solely based on non-disclosive aggregated data from different institutions. The approach should be applicable in a setting with high-dimensional data with complex correlation structures. Nonetheless, the amount of transferred data should be limited to enable manual confirmation of data protection compliance.

Our approach implements an iterative optimization between local and global model estimates. Herein, the linear predictor of the global model will act as a covariate in the local model estimation. Subsequently, the linear predictor of the updated local model is included in the global model estimation. The procedure is repeated until no further model improvement is observed for the local model estimates. In case of an unknown variable structure, our approach can be extended with an iterative boosting procedure performing variable selection for both the global and local model.

In a simulation study, we aim to show that our approach improves both global and local model estimates while preserving the globally found effect structure. Furthermore, we want to demonstrate the approach to grant protected access to a multi-cohort data pool concerning gender sensitive studies. Specifically, we aim to apply the approach to improve upon cohort-specific model estimates by incorporating a global model based on multiple cohorts. We will apply the method to real data obtained in the GESA project, where we combined data from the three large German population-based cohorts GHS, SHIP, and KORA to identify potential predictors for mental health protectories.

In general, all gradient-based methods can be adapted easily to a federated setting under data protection constraints. The here presented method can be used in this setting to perform iterative optimisation and can thus aid in the process of understanding cohort-specific estimates. We provide an implementation in the DataSHIELD framework.

A replication crisis in methodological statistical research?
Anne-Laure Boulesteix1, Stefan Buchka1, Alethea Charlton1, Sabine Hoffmann1, Heidi Seibold2, Rory Wilson2
1LMU Munich, Germany; 2Helmholtz Zentrum Munich, Germany

Statisticians are often keen to analyze the statistical aspects of the so-called “replication crisis”. They condemn fishing expeditions and publication bias across empirical scientific fields applying statistical methods. But what about good practice issues in their own – methodological – research, i.e. research considering statistical methods as research objects? When developing and evaluating new statistical methods and data analysis tools, do statisticians adhere to the good practice principles they promote in fields which apply statistics? I argue that statisticians should make substantial efforts to address what may be called the replication crisis in the context of methodological research in statistics and data science. In the first part of my talk, I will discuss topics such as publication bias, the design and necessity of neutral comparison studies and the importance of appropriate reporting and research synthesis in the context of methodological research.

In the second part of my talk I will empirically illustrate a specific problem which affects research articles presenting new data analysis methods. Most of these articles claim that “the new method performs better than existing methods”, but the veracity of such statements is questionable. An optimistic bias may arise during the evaluation of novel data analysis methods resulting from, for example, selection of datasets or competing methods; better ability to fix bugs in a preferred method; and selective reporting of method variants. This bias is quantitatively investigated using a topical example from epigenetic analysis: normalization methods for data generated by the Illumina HumanMethylation450K BeadChip microarray.

Reproducible bioinformatics workflows: A case study with software containers and interactive notebooks
Anja Eggert, Pal O Westermark
Leibniz Institute for Farm Animal Biology, Deutschland

We foster transparent and reproducible workflows in bioinformatics, which is challenging given their complexity. We developed a new statistical method in the field of circadian rhythmicity, which allows to rigorously determine whether measured quantities such as gene expressions are not rhythmic. Knowledge of no or at most weak rhythmicity may significantly simplify studies, aid detection of abolished rhythmicity, and facilitate selection of non-rhythmic reference genes or compounds, among other applications. We present our solution to this problem in the form of a precisely formulated mathematical statistic accompanied by a software called SON (Statistics Of Non-rhythmicity). The statistical method itself is implemented in the R package “HarmonicRegression”, available on the CRAN repository. However, the bioinformatics workflow is much larger than the statistical test. For instance, to ensure the applicability and validity of the statistical method, we simulated data sets of 20,000 gene expressions over two days, with a large range of parameter combinations (e.g. sampling interval, fraction of rhythmicity, amount of outliers, detection limit of rhythmicity, etc.). Here we describe and demonstrate the use of a Jupyter notebook to document, specify, and distribute our new statistical method and its application to both simulated and experimental data sets. Jupyter notebooks combine text documentation with dynamically editable and executable code and are an implementation of the concept of literate programming. Thus, parameters and code can be modified, allowing both verification of results, as well as instant experimentation by peer reviewers and other users of the science community. Our notebook runs inside a Docker software container, which mirrors the original software environment. This approach avoids the need to install any software and ensures complete long-term reproducibility of the workflow. This bioinformatics workflow allows full reproducibility of our computational work.

Young Statisticians Session

Chairs: Bjoern-Hergen Laabs and Janine Witte

Predictions by random forests – confidence intervals and their coverage probabilities
Diana Kormilez, Björn-Hergen Laabs, Inke R. König
Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Germany

Random forests are a popular supervised learning method. Their main purpose is the robust prediction of an outcome based on a learned set of rules. To evaluate the precision of predictions their scattering and distributions are important. In order to quantify this, 95 % confidence intervals for the predictions can be generated using suitable variance estimators. However, these variance estimators may be under- or overestimated and the confidence intervals thus cover ranges either too small or too large, which can be evaluated by estimating coverage probabilities through simulations. The aim of our study was to examine coverage probabilities for two popular variance estimators for predictions made by random forests, the infinitesimal jackknife according to Wager et al. (2014) and the fixed-point based variance estimator according to Mentch and Hooker (2016). We performed a simulation study considering different scenarios with varying sample sizes and various signal-to-noise ratios. Our results show that the coverage probabilities based on the infinitesimal jackknife are lower than the desired 95 % for small data sets and small random forests. On the other hand, the variance estimator according to Mentch and Hooker (2016) leads to overestimated coverage probabilities. However, a growing number of trees yields decreasing coverage probabilities for both methods. A similar behavior was observed when using real datasets, where the composition of the data and the number of trees influence the coverage probabilities. In conclusion, we observed that the relative performance of one variance estimation method over the other depends on the hyperparameters used for training the random forest. Likewise, the coverage probabilities can be used to evaluate how well the hyper-parameters were chosen and whether the data set requires more pre-processing.

Mentch L, Hooker G (2016): Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests. J. Mach. Learn. Res., 17:1-41.

Wager S, Hastie T, Efron B (2014): Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. J. Mach. Learn. Res., 15:1625-1651.

Over-optimism in benchmark studies and the multiplicity of analysis strategies when interpreting their results
Christina Nießl1, Moritz Herrmann2, Chiara Wiedemann1, Giuseppe Casalicchio2, Anne-Laure Boulesteix1
1Institute for Medical Informatics, Biometry and Epidemiology, University of Munich (Germany); 2Department of Statistics, University of Munich (Germany)

In recent years, the need for neutral benchmark studies that focus on the comparison of statistical methods has been increasingly recognized. At the interface between biostatistics and bioinformatics, benchmark studies are especially important in research fields involving omics data, where hundreds of articles present their newly introduced method as superior to other methods.

While general advice on the design of neutral benchmark studies can be found in recent literature, there is always a certain amount of flexibility that researchers have to deal with. This includes the choice of datasets and performance metrics, the handling of missing values in case of algorithm failure (e.g., due to non-convergence) and the way the performance values are aggregated over the considered datasets. Consequently, different choices in the benchmark design may lead to different results of the benchmark study.

In the best-case scenario, researchers make well-considered design choices prior to conducting a benchmark study and are aware of this issue. However, they may still be concerned about how their choices affect the results. In the worst-case scenario, researchers could (most often subconsciously) use this flexibility and modify the benchmark design until it yields a result they deem satisfactory (for example, the superiority of a certain method). In this way, a benchmark study that is intended to be neutral may become biased.

In this paper, we address this issue in the context of benchmark studies based on real datasets using an example benchmark study, which compares the performance of survival prediction methods on high-dimensional multi-omics datasets. Our aim is twofold. As a first exploratory step, we examine how variable the results of a benchmark study are by trying all possible combinations of choices and comparing the resulting method rankings. In the second step, we propose a general framework based on multidimensional unfolding that allows researchers to assess the impact of each choice and identify critical choices that substantially influence the resulting method ranking. In our example benchmark study, the most critical choices were the considered datasets and the performance metric. However, in some settings, we observed that the handling of missing values and the aggregation of performances over the datasets can also have a great influence on the results.

Variable Importance Measures for Functional Gradient Descent Boosting Algorithm
Zeyu Ding
TU Dortmund, Germany

With the continuous growth of data dimensions and the improvement of computing power in contemporary statistics, the number of variables that can be included in a model is increasing significantly. Therefore, when designing a model, selecting truly informative variables from a bunch of messy variables and ranking them according to their importance becomes a core problem of statistics. Appropriate variable selection can reduce overfitting and leads to a more interpretable model. Traditional methods use the decomposition of R2, step-wise Akaike Information Criterion (AIC)-based variable selection, or regularization based on lasso and ridge. In ensemble algorithms, the variable importance is often calculated separately by the permutation methods. In this contribution, we propose two new stable and discriminating variable importance measures for the functional gradient descent boosting algorithm (FGDB). The first one calculates the l2 norm contribution of a variable while the second one calculates the risk reduction of the variables in every iteration. Our proposal is demonstrated in both simulation and real data examples. We show that the two new methods are more effective in automatically selecting those truly important variables in different data scenarios than the traditional selection frequency measures used in FGDB algorithm. This holds for both linear and non-linear models under different data scenarios.

keywords: variable importance measures, variable selection, functional gradient boosting, component-wise regression, generalized additive models.

Quality control in genome-wide association studies revisited: a critical evaluation of the standard methods
Hanna Brudermann1, Tanja K. Rausch1,2, Inke R. König1
1Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Germany, Germany; 2Department of Pediatrics, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany

Genome-wide association studies (GWAs) investigating the relationship between millions of genetic markers and a clinically relevant phenotype were originally based on the common disease – common variant assumption, thus aiming at identifying a small number of common genetic loci as cause for common diseases. Given the enormous cost reduction in the acquisition of genomic data, it is not surprising that since the first known GWA by Klein et al. (2005), this study type was established as a standard method. However, since even low error frequencies can distort association results, extensive and accurate quality control of the given data is mandatory. In recent years, the focus of GWAs has shifted, and the task is no longer primarily the discovery of common genetic loci. Also, with increasing sample sizes and (mega-)meta-analyses of GWAs, it is hoped that loci with small effects can be identified. Furthermore, it has become popular to aggregate all genomic information, even loci with very small effects and frequencies, into genetic risk prediction scores, thus increasing the requirement for high-quality genetic data.

However, after extensive discussions about standards for quality control in GWAs in the early years, further work on how to control data quality and adapt data cleaning to new GWAs aims has become scarce.

The aim of this study was to perform an extensive literature review to evaluate currently applied quality control criteria and their justification. Building on the findings from the literature search, a workflow was developed to include justified quality control steps, keeping in mind that a strict quality control, which removes all data with a high risk of bias, always carries the risk that the remaining data is too homogeneous to make small effects visible. This workflow is subsequently illustrated using a real data set.

Our results show that in most published GWAs, no scientific reasons for the applied quality steps are given. Cutoffs for the most common quality measures are mostly not explained. Especially the principal component analysis and the test for deviation from Hardy-Weinberg equilibrium are frequently used as quality criteria in many GWAs without analyzing the existing conditions exactly and adjusting the quality control accordingly.

It is pointed out that researchers still have to decide between universal and individual parameters and therefore between optimal comparability to other analyses and optimal conditions within the specific study.

Which Test for Crossing Survival Curves? A User’s Guide
Ina Dormuth1, Tiantian Liu2,4, Jin Xu2, Menggang Yu3, Markus Pauly1, Marc Ditzhaus1
1TU Dortmund, Deutschland; 2East China Normal University, China; 3University of Wisconsin-Madison, USA; 4Technion ‐ Israel Institute of Technology, Israel

Knowledge transfer between statisticians developing new data analysis methods, and users is essential. This is especially true for clinical studies with time-to-event endpoints. One of the most common problems is the comparison of survival in two-armed trials. The log-rank test is still the gold standard for answering this question. However, in the case of non-proportional hazards, its informative value may decrease. In the meantime several extensions have been developed to solve this problem. Since non-proportional or even intersecting survival curves are common in oncology, e.g. in immunotherapy studies, it is important to identify the most appropriate methods and to draw attention to their existence. Therefore, it is our goal to simplify the choice of a test to detect differences in survival rate in case of crossings. To this end, we reviewed 1,400 recent oncological studies. Limiting our analysis to intersecting survival curves and non-significant log-rank tests for a sufficient number of observed events we reconstructed the data sets using a state-of-the-art reconstruction algorithm. To ensure reproductive quality, only publications with a published number of risk at multiple points in time, sufficient print quality, and a non-informative censoring pattern were included. After elimination of papers on the basis of the exclusion criteria mentioned above, we compared the p-values of the log-rank and the Peto-Peto test as references and compare them with nine different tests for non-proportional or even crossing hazards. It is shown that tests designed to detect crossing hazards are advantageous and provide guidance in choosing a reasonable alternative to the standard log-rank test. This is followed by a comprehensive simulation study and the generalization of one of the test methods to the multi-sample case.