Location: https://wwu.zoom.us/j/66316877425

Adaptive Designs III

Chairs: Tobias Mütze and Martin Posch

Adaptive group sequential survival comparisons based on log-rank and pointwise test statistics
Jannik Feld, Andreas Faldum, Rene Schmidt
Institute of Biostatistics and Clinical Research, University of Münster

Whereas the theory of confirmatory adaptive designs is well understood for uncensored data, implementation of adaptive designs in the context of survival trials remains challenging. Commonly used adaptive survival tests are based on the independent increments structure of the log-rank statistic. These designs suffer the limitation that effectively only the interim log-rank statistic may be used for design modifications (such as data-dependent sample size recalculation). Alternative approaches based on the patient-wise separation principle have the common disadvantage that the test procedure may either neglect part of the observed survival data or tends to be conservative. Here, we instead propose an extension of the independent increments approach to adaptive survival tests. We present a confirmatory adaptive two-sample log-rank test of no difference in a survival analysis setting, where provision is made for interim decision making based on both the interim log-rank statistic and/or pointwise survival-rates, while avoiding aforementioned problems. The possibility to include pointwise survival-rates eases the clinical interpretation of interim decision making and is a straight forward choice for seamless phase II/III designs. We will show by simulation studies that the performance does not suffer using the pointwise survival-rates and exemplary consider application of the methodology to a two-sample log-rank test with binding futility criterion based on the observed short-term survival-rates and sample size recalculation based on conditional power. The methodology is motivated by the LOGGIC Europe Trial from pediatric oncology. Distributional properties are derived using martingale techniques in the large sample limit. Small sample properties are studied by simulation.

Single-stage, three-arm, adaptive test strategies for non-inferiority trials with an unstable reference
Werner Brannath, Martin Scharpenberg, Sylvia Schmidt
University of Bremen, Germany

For indications where only unstable reference treatments are available and use of placebo is ethically justified, three-arm `gold standard‘ designs with an experimental, reference and placebo arm are recommended for non-inferiority trials. In such designs, the demonstration of efficacy of the reference or experimental treatment is a requirement. They have the disadvantage that only little can be concluded from the trial if the reference fails to be efficacious. To overcome this, we investigate a novel single-stage, adaptive test strategies where non-inferiority is tested only if the reference shows sufficient efficacy and otherwise delta-superiority of the experimental treatment over placebo is tested. With a properly chosen superiority margin, delta-superiority indirectly shows non-inferiority. We optimize the sample size for several decision rules and find that the natural, data driven test strategy, which tests with non-inferiority if the reference’s efficacy test is significant, leads to the smallest overall and placebo sample sizes. Under specific constraints on the sample sizes, this procedure controls the family-wise error rate. All optimal sample sizes are found to meet this constraint. We finally show how to account for a relevant placebo drop-out rate in an efficient way and apply the new test strategy to a real life data set.

Sample size re-estimation based on the prevalence in a randomized test-treatment study
Amra Hot, Antonia Zapf
Institute of Medical Biometry and Epidemiology, University Medical Center Hamburg-Eppendorf, Hamburg

Patient benefit should be the primary criterion in evaluating diagnostic tests. If a new test has shown sufficient accuracy, its application in clinical practice should yield to a patient benefit. Randomized test-treatment studies are needed to assess the clinical utility of a diagnostic test as part of a broader management regimen in which test-treatment strategies are compared in terms of their impact on patient relevant outcomes [1]. Due to their increased complexity compared to common intervention trials the implementation of such studies poses practical challenges which might affect the validity of the study. One important aspect is the sample size determination. It is a special feature of these designs that they combine information on the disease prevalence and accuracy of the diagnostic tests, i.e. sensitivity and specificity of the investigated tests, with assumptions on the expected treatment effect. Due to the lack of empirical information or uncertainty regarding these parameters sample size consideration will always be based on a rather weak foundation, thus leading to an over- or underpowered trial. Therefore, it is reasonable to consider adaptations in earlier phases of the trial based on a pre-specified interim analysis in order to solve this problem. A blinded sample size re-estimation based on the disease prevalence in a randomized test-treatment study was performed as part of a simulation study. The type I error, the empirical overall power as well as the bias of the estimated prevalence are assessed and presented.


[1] J. G. Lijmer, P.M. Bossuyt. Diagnostic testing and prognosis: the randomized controlled trial in test evaluation research. In: The evidence base of clinical diagnosis. Blackwell Oxford, 2009, 63-82.

Performance evaluation of a new “diagnostic-efficacy-combination trial design” in the context of telemedical interventions
Mareen Pigorsch1, Martin Möckel2, Jan C. Wiemer3, Friedrich Köhler4, Geraldine Rauch1
1Charité – Universitätsmedizin Berlin, Institute of Biometry and clinical Epidemiology; 2Charité – Universitätsmedizin Berlin, Division of Emergency and Acute Medicine, Cardiovascular Process Research; 3Clinical Diagnostics, Thermo Fisher Scientific; 4Charité – Universitätsmedizin Berlin, Centre for Cardiovascular Telemedicine, Department of Cardiology and Angiology


Telemedical interventions in heart failure patients intend to avoid unfavourable, treatment-related events by an early, individualized care, which reacts to the current patients need. However, telemedical support is an expensive intervention and only patients with a high risk for unfavourable follow-up events will profit from telemedical care. Möckel et al. therefore adapted a “diagnostic-efficacy-combination design” which allows to validate a biomarker and investigate a biomarker-selected population within the same study. For this, cut-off values for the biomarkers were determined based on the observed outcomes in the control group to define a high-risk subgroup. This defines the diagnostic design step. These cut-offs were subsequently applied to the intervention and the control group to identify the high-risk subgroup. The intervention effect is then evaluated by comparison of these subgroups. This defines the efficacy design step. So far, it has not been evaluated if this double use of the control group for biomarker validation and efficacy comparison leads to a bias in treatment effect estimation. In this methodological research work, we therefore want to evaluate whether the “diagnostic-efficacy-combination design” leads to biased treatment effect estimates. If there is a bias, we further want to analyse its impact and the parameters influencing its size.


We perform a systematic Monte-Carlo simulation study to investigate potential bias in various realistic trial scenarios that mimic and vary the true data of the published TIM‐HF2 Trial. In particular we vary the event rates, the sample sizes and the biomarker distributions.


The results show, that indeed the proposed design leads to some bias in the effect estimators, indicating an overestimation of the effect. But this bias is relatively small in most scenarios. The larger the sample size, the more the event rates differ in the control and the intervention group and the better the biomarker can separate the high-risk from the low-risk patients, the smaller is the resulting relative bias.


The “diagnostic-efficacy-combination design” can be recommended for clinical applications. We recommend ensuring a sufficient large sample size.


Möckel M, Koehler K, Anker SD, Vollert J, Moeller V, Koehler M, Gehrig S, Wiemer JC, von Haehling S, Koehler F. Biomarker guidance allows a more personalized allocation of patients for remote patient management in heart failure: results from the TIM-HF2 trial. Eur J Heart Fail. 2019;21(11):1445-58.

Valid sample size re-estimation at interim
Nilufar Akbari
Charité – Institute of Biometry and Clinical Epidemiology, Germany

Throughout this work, we consider the situation of a two-arm controlled clinical trial based on time-to-event data.

The aim of this thesis is to estimate a meaningful survival model in a robust way to observed data during an interim analysis in order to carry out a valid sample size recalculation.

Adaptive designs provide an attractive possibility of changing study design parameters in an ongoing trial. There are still many open questions with respect to adaptive designs for time-to-event data. Among other things, this is because survival data, unlike continuous or binary data, undertake a follow-up phase, so that the outcome is not directly observed after patient’s treatment.

Evaluating survival data at interim analyses leads to a patient overrun since the recruitment is usually not stopped at the same time. Another problem is that there must be an interim analysis during this recruitment phase to save patients. Moreover, the timing of the interim analysis is a crucial point build decisions upon a reasonable level of information.

A general issue about time-to-event data is that at an interim analysis one can only calculate the updated size of the required number of events. However, there is normally a greater need in the determination of the sample size to achieve that required number of events. Therefore, the underlying event-time distribution is needed, which may possibly be estimated from the interim data.

This however, is a difficult task for the following reasons: The number of observed events at interim is limited, and the survival curve at interim is truncated by the interim time point.

The goal of this research work is to fit a reasonable survival model to the observed data in a robust way. The fitted curve has the following advantages: The underlying hazards per group can be estimated which allows updating the required number of patients for achieving the respective number of events. Finally, the impact of overrun can be directly assessed and quantified.

The following problems were additionally evaluated in detail. How much do the hazards deviate if the wrong event-time distribution was estimated? At which point in time is a sample size re-estimation useful, or rather how many events are required, for a valid sample size re-estimation at interim?

Multiple Testing

Chairs: Arne Bathke and Robert Kwiecien

Analysis and sample size calculation for a conditional survival model with a binary surrogate endpoint
Samuel Kilian, Johannes Krisam, Meinhard Kieser
Institute of Medical Biometry and Informatics; University Heidelberg; Heidelberg, Germany

The primary endpoint in oncology is usually overall survival, where differences between therapies may only be observable after many years. To avoid withholding of a promising therapy, preliminary approval based on a surrogate endpoint is possible in certain situations (Wallach et al., 2018). The approval has to be confirmed later when overall survival can be assessed. When this is done within the same study, the correlation between surrogate endpoint and overall survival has to be taken into account for sample size calculation and analysis. This relation can be modeled by means of a conditional survival model which was proposed by Xia et al. (2014). They investigated the correlation and assessed power of the logrank test but did not develop methods for statistical testing, parameter estimation, and sample size calculation.

In this talk, a new statistical testing procedure based on the conditional model and Maximum Likelihood (ML) estimators for its parameters will be presented. An asymptotic test for survival difference will be given and an approximate sample size formula will be derived. Furthermore, an exact test for survival difference and an algorithm for exact sample size determination will be provided. Type I error rate, power, and required sample size for both newly developed tests will be determined exactly. Sample sizes will be compared to those required for the logrank test.

It will be shown that for small sample sizes the asymptotic parametric test and the logrank test exceed the nominal significance level under the conditional model. For a given sample size, the power of the asymptotic and the exact parametric test is similar, whereas the power of the logrank test is considerably lower in many situations. The other way round, the sample size needed to attain a prespecified power is comparable for the asymptotic and the exact parametric test, but considerably higher for the logrank test in many situations.

We conclude that the presented exact test performs very well under the assumptions of the conditional model and is a better choice than the asymptotic parametric test or the logrank test, respectively. Furthermore, the talk will give some insights in performing exact calculations for parametric survival time models. This provides a fast and powerful method to evaluate parametric tests for survival difference, thus facilitating the planning, conduct, and analysis of oncology trials with the option of accelerated approval.

The max-t Test in High-Dimensional Repeated Measures and Multivariate Designs
Frank Konietschke
Charite Berlin, Germany

Repeated measures (and multivariate) designs occur in a variety of different research areas. Hereby, the designs might be high-dimensional, i.e. more (possibly)

dependent than independent replications of the trial are observed. In recent years, several global testing procedures (studentized quadratic forms) have been proposed for the analysis of such data. Testing global null hypotheses, however, usually does not answer the main question of practitioners, which is the specific localization of significant time points or group*time interactions. The use of max-t tests on the contrary, can provide this important information. In this talk, we discuss its applicability in such designs. In particular, we approximate the distribution of the max t-test statistic using innovative resampling strategies. Extensive simulation studies show that the test is particularly suitable for the analysis of data sets with small sample sizes . A real data set

illustrates the application of the method.

Graphical approaches for the control of generalized error rates
Frank Bretz1, David Robertson2, James James Wason3
1Novartis, Switzerland; 2University of Cambridge, UK; 3Newcastle University, UK

When simultaneously testing multiple hypotheses, the usual approach in the context of confirmatory clinical trials is to control the familywise error rate (FWER), which bounds the probability of making at least one false rejection. In many trial settings, these hypotheses will additionally have a hierarchical structure that reflects the relative importance and links between different clinical objectives. The graphical approach of Bretz et al (2009) is a flexible and easily communicable way of controlling the FWER while respecting complex trial objectives and multiple structured hypotheses. However, the FWER can be a very stringent criterion that leads to procedures with low power, and may not be appropriate in exploratory trial settings. This motivates controlling generalized error rates, particularly when the number of hypotheses tested is no longer small. We consider the generalized familywise error rate (k-FWER), which is the probability of making k or more false rejections, as well as the tail probability of the false discovery proportion (FDP), which is the probability that the proportion of false rejections is greater than some threshold. We also consider asymptotic control of the false discovery rate, which is the expectation of the FDP. In this presentation, we show how to control these generalized error rates when using the graphical approach and its extensions. We demonstrate the utility of the resulting graphical procedures on clinical trial case studies.

Statistical Inference for Diagnostic Test Accuracy Studies with Multiple Comparisons
Max Westphal1, Antonia Zapf2
1Fraunhofer Institute for Digital Medicine MEVIS, Bremen, Germany; 2Institute of Medical Biometry and Epidemiology, UKE Hamburg, Hamburg, Germany

Diagnostic accuracy studies are usually designed to assess the sensitivity and specificity of an index test in relation to a reference standard or established comparative test. This so-called co-primary endpoint analysis has recently been extended to the case that multiple index tests are investigated [1]. Such a design is relevant in modern applications where many different (machine-learned) classification rules based on high dimensional data are considered initially as the final model selection can (partially) be based on data from the diagnostic accuracy study.

In this talk, we motivate the according hypothesis problem and propose different multiple test procedures for that matter. Besides classical parametric corrections (Bonferroni, maxT) we also consider Bootstrap approaches and a Bayesian procedure. We will present early findings from a simulation study to compare the (family-wise) error rate and power of all procedures.

A general observation from the simulation study is the wide variability of rejection rates under different (realistic and least-favorable) parameter configurations. We discuss these findings and possible future extensions of our numerical experiments. All methods have been implemented in a new R package which will also be introduced briefly.


1. Westphal, Max, Antonia Zapf, and Werner Brannath. „A multiple testing framework for diagnostic accuracy studies with co-primary endpoints.“ arXiv preprint arXiv:1911.02982 (2019).

Real World Evidence

Chairs: Sigrid Behr and Irene Schmidtmann

RCT versus RWE: Good versus evil or yin and yang?
Almut Winterstein
University of Florida, USA

Clinicians, researchers and policy makers have been raised in a paradigm that places randomized clinical trials on top of a hierarchy of evidence or that dichotomizes study designs into randomized, which is equated to valid, and not randomized, which is equated to invalid or highly dubious. Major efforts to enhance drug safety research infrastructure have shifted our acceptance of observational designs, especially in instances where the adverse event is not anticipated and unrelated to a drug’s indication, resulting in limited confounding. Other instances where evidence from non-randomized studies is accepted include situations where randomization is not feasible. The most recent evolution of real-world evidence as main source of evidence for approval of new molecular entities or indications further challenges our historic understanding of the hierarchy of evidence and the scientific method.

Through randomization and blinding, comparison groups are largely balanced on both measured and unmeasured factors if the trial has sufficient sample size. Protocol-based outcomes ascertainment ensures unbiased, structured assessments regardless of exposure status or baseline characteristics. Used jointly, RCTs can mitigate both selection and measurement biases and support causal inferences. However, besides the escalating cost of RCTs and other feasibility issues, various problems arise that require supplemental methodological approaches to inform regulatory and clinical decision-making, including poor generalizability resulting in inductive fallacy; limited ability to explore effect modification; and significant delays in evidence generation.

Legislative action to address some of these shortcomings was formalized in the United States in the 21st Century Cures Act from 2016, which is designed to help accelerate medical product development. One central component is the concept of real-world evidence, i.e., evidence about the safety and effectiveness of medications derived from real-world data. Importantly, the Cures Act formalizes the concept that valid and actionable evidence can be derived from non-experimental settings using observational study designs and advanced analytic methods. In this presentation we aim to illustrate that dichotomous approaches that contrast RCTs and RWE are limited in their understanding of the full range of methodological challenges in making causal inferences and then generalizing such inferences for real-world decision-making. Those challenges are discussed across the spectrum of traditional RCTs, pragmatic RCTs that rely on RWD or hybrid designs, and observational studies that rely on RWD. The presentation will end with specific challenges for RWE research in the era of increasing data availability and artificial intelligence.

Diagnostic accuracy of claims data from 70 million people in the German statutory health insurance: Type 2 diabetes in men
Ralph Brinks1,2,3, Thaddaeus Toennies1, Annika Hoyer2
1Deutsches Diabetes-Zentrum, Germany; 2Department of Statistics, Ludwig-Maximilians-University Munich; 3Department of Rheumatology, University Hospital Duesseldorf

During estimation of excess mortality in people with type 2 diabetes in Germany based on aggregated claims data from about 70 million people in the statutory health insurance, we experienced and reported problems in the age groups below 60 years of age [1]. We hypothesized that diagnostic accuracy (sensitivity and specificity) might be the reason for those problems [1].

In the first part of this work, we ran a simulation study to assess the impact of the diagnostic accuracy on the estimation of excess mortality. It turns out that the specificity in the younger age groups has the greatest effect on the estimate in terms of bias of the excess mortality while the sensitivity has a much lower impact.

In the second part, we apply these findings to estimate the diagnostic accuracy of type 2 diabetes in men aged 20-90 based on the approach and data from [1]. We obtain that irrespective of the sensitivity, the false positive ratio (FPR) increases linearly from 0.5 to 2 per mil from age 20 to 50. At ages 50 to 70, the FPR is likely to drop to 0.5 per mil, followed by a steep linear increase to 5 per mil at age 90.

Our examination demonstrates the crucial impact of diagnostic accuracy on estimates based on secondary data. While for other epidemiological measures sensitivity might be more important, estimation of excess mortality crucially depends on the specificity of the data. We use this fact to estimate the age-specific FPR of diagnoses of type 2 diabetes in aggregated claims data.


[1] Brinks R, Tönnies T, Hoyer A (2020) DOI 10.1186/s13104-020-05046-w

Coronary artery calcification in the middle-aged and elderly population of Denmark
Oke Gerke1,2, Jes Sanddal Lindholt1,3, Barzan Haj Abdo1, Axel Cosmus Pyndt Diederichsen1,4
1Dept. of Clinical Research, University of Southern Denmark, DK; 2Dept. of Nuclear Medicine, Odense University Hospital, DK; 3Dept. of Cardiothoracic and Vascular Surgery, Odense University Hospital, DK; 4Dept. of Cardiology, Odense University Hospital, DK

Aims: Coronary artery calcification (CAC) measured on cardiac CT is an important risk marker for cardiovascular disease (CVD), and has been included in the prevention guidelines. The aim of this study was to describe CAC score reference values and to develop a free available CAC calculator in the middle-aged and elderly population. This work updates two previously published landmark studies on CAC score reference values, the American MESA study and the German HNR study [1,2]. Differences in curve-derivation compared to a recently published pooled analysis are discussed [3].

Methods: 17,252 participants from two population-based cardiac CT screening cohorts (DanRisk and DANCAVAS) were included [4,5]. The CAC score was measured as a part of s screening session. Positive CAC scores were log-transformed and nonparametrically regressed on age for each gender, and percentile curves were transposed according to proportions of zero CAC scores.

Results: Men had higher CAC scores than women, and the prevalence and extend of CAC increased steadily with age. An online CAC calculator was developed, http://flscripts.dk/cacscore/. After entering sex, age and CAC score, the CAC score percentile and the coronary age are depicted including a figure with the specific CAC score and 25%, 50%, 75% and 90% percentiles. The specific CAC score can be compared to the entire background population or only those without prior CVD.

Conclusion: This study provides modern population-based reference values of CAC scores in men and woman, and a freely accessible online CAC calculator. Physicians and patients are very familiar with blood pressure and lipids, but unfamiliar with CAC scores. Using the calculator makes it easy to see if a CAC value is low, moderate or high, when a physician in the future communicates and discusses a CAC score with a patient.


[1] Schmermund A et al. Population-based assessment of subclinical coronary atherosclerosis using electron-beam computed tomography. Atherosclerosis 2006;185(1):177-182.

[2] McClelland RL et al. Distribution of coronary artery calcium by race, gender, and age: results from the Multi-Ethnic Study of Atherosclerosis (MESA). Circulation 2006;113(1):30-37.

[3] de Ronde MWJ et al. A pooled-analysis of age and sex based coronary artery calcium scores percentiles. J Cardiovasc Comput Tomogr. 2020;14(5):414-420.

[4] Diederichsen AC et al. Discrepancy between coronary artery calcium score and HeartScore in middle-aged Danes: the DanRisk study. Eur J Prev Cardiol 2012;19(3):558-564.

[5] Diederichsen AC et al. The Danish Cardiovascular Screening Trial (DANCAVAS): study protocol for a randomized controlled trial. Trials 2015;16:554.

Adaptive Designs II

Chairs: Geraldine Rauch and Gernot Wassmer

Opportunities and limits of optimal group-sequential designs
Maximilian Pilz1, Carolin Herrmann2,3, Geraldine Rauch2,3, Meinhard Kieser1
1Institute of Medical Biometry and Informatics – University of Heidelberg, Germany; 2Institute of Biometry and Clinical Epidemiology – Charité University Medicine Berlin, Germany; 3Berlin Institute of Health

Multi-stage designs for clinical trials are becoming increasingly popular. There are two main reasons for this development. The first is the flexibility to modify the study design during the ongoing trial. This possibility is highly beneficial to avoid the failure of trials whose planning assumptions were enormously wrong. However, an unplanned design modification mid-course can also be performed in a clinical trial that has initially been planned without adaptive elements as long as the conditional error principle is applied. The second reason for the popularity of adaptive designs is the performance improvement that arises by applying a multi-stage design. For instance, an adaptive two-stage design can enormously reduce the expected sample size of a trial compared to a single-stage design.

With regard to this performance reason, a two-stage design can entirely be pre-specified before the trial starts. While this still leaves open the option to modify the design, it is preferred by regulatory authorities. Recent work treats the topic of optimal adaptive designs. While those show the best possible performance, it may be difficult to communicate them to a practitioner and to outline them in a study protocol.

To overcome this problem, simpler optimal group-sequential designs may be an option worth to be considered. Those only consist of two sample sizes (stage one and stage two) and three critical values (early futility, early efficacy, final analysis). Thus, they can easily be described and communicated.

In this talk, we present a variety of examples to investigate whether optimal group-sequential designs are a valid approximation of optimal adaptive designs. We elaborate design properties that can be fulfilled by optimal group-sequential designs without considerable performance deterioration and describe situations where an optimal adaptive design may be more appropriate. Furthermore, we give recommendations of how to specify an optimal two-stage design in the study protocol in order to motivate their application in clinical trials.

Group Sequential Methods for Nonparametric Relative Effects
Claus Peter Nowak1, Tobias Mütze2, Frank Konietschke1
1Charité – Universitätsmedizin Berlin, Germany; 2Novartis Pharma AG, Basel, Switzerland

Late phase clinical trials are occasionally planned with one or more interim analyses to allow for early termination or adaptation of the study. While extensive theory and software has been developed for normal, binary and survival endpoints, there has been comparatively little discussion in the group sequential literature on nonparametric methods outside the time-to-event setting. Focussing on the comparison of two parallel treatment arms, we show that the Wilcoxon-Mann-Whitney test, the Brunner-Munzel test, as well as a test procedure based on the log win odds, a modification of the win ratio, asymptotically follow the canonical joint distribution. Consequently, standard group sequential theory can be applied to plan, analyse and adapt clinical trials based on nonparametric efficacy measures. In addition, simulation studies examining type I error rates confirm the adequacy of the proposed methods for a range of scenarios. Lastly, we apply our methodology to the FREEDOMS clinical trial (ClinicalTrials.gov Identifier: NCT00289978), analysing relapse in patients with relapse-remitting multiple sclerosis.

Optimal futility stops in two-stage group-sequential gold-standard designs
Jan Meis, Maximilian Pilz, Meinhard Kieser
Institute of Medical Biometry and Informatics, University of Heidelberg, Heidelberg, Germany

A common critique of non-inferiority trials comparing an experimental treatment to an active control is that they may lack assay sensitivity. This denotes the ability of a trial to distinguish an effective treatment from an ineffective one. The ‚gold-standard‘ non-inferiority trial design circumvents this concern by comparing three groups in a hierarchical testing procedure. First, the experimental treatment is compared to a placebo group in an effort to show superiority. Only if this succeeds, the experimental treatment is tested for non-inferiority against an active control group. Ethical and practical considerations require sample sizes of clinical trials to be as large as necessary, but as small as possible. These considerations come especially pressing in the gold-standard design, as patients are exposed to placebo doses while the control treatment is already known to be effective.

Group sequential trial designs are known to reduce the expected sample size under the alternative hypothesis. In their pioneer work, Schlömer and Brannath (2013) show that the gold-standard design is no exception to this rule. In their paper, they calculate approximately optimal rejection boundaries for the gold-standard design given sample size allocation ratios of the optimal single stage design. We extend their work by relaxing the constraints put on the group allocation ratios and allowing for futility stops at interim. The futility boundaries and the sample size allocation ratios will be considered as optimization parameters, together with the efficacy boundaries. This allows the investigation of the efficiency gain by including the option to stop for futility. Allowing discontinuation of a trial when faced with underwhelming results at an interim analysis has very practical implications in saving resources and sparing patients from being exposed to ineffective treatment. In the gold-standard design, these considerations are especially pronounced. There is a large incentive to prevent further patients from being exposed to placebo treatment when interim results suggest that a confirmatory result in the final analysis becomes unlikely.

Besides the extended design options, we analyse different choices of optimality criteria. The above considerations suggest that the null hypothesis also plays an important role in the judgement of the gold-standard design. Therefore, optimality criteria that incorporate the design performance under the alternative and the null hypothesis are introduced. The results of our numerical optimization procedure for this extended design will be discussed and compared to the findings of Schlömer and Brannath.

Blinded sample size re-estimation in a paired diagnostic study
Maria Stark, Antonia Zapf
University Medical Center Hamburg-Eppendorf, Germany

In a paired confirmatory diagnostic accuracy study, a new experimental test is compared within the same patients to an already existing comparator test. The gold standard defines the true disease status. Hence, each patient undergoes three diagnostic procedures. If feasible and ethically acceptable, regulatory agencies prefer this study design to an unpaired design (CHMP, 2009). The initial sample size calculation is based on assumptions about, among others, the prevalence of the disease and the proportion of discordant test results between the experimental and the comparator test (Miettinen, 1968).

To adjust these assumptions during the study period, an adaptive design for a paired confirmatory diagnostic accuracy study is introduced. This adaptive design is used to re-estimate the prevalence and the proportion of discordant test results to finally re-calculate the sample size. It is a blinded adaptive design as the sensitivity and the specificity of the experimental and comparator test are not re-estimated. Due to the blinding, the type I error rates are not inflated.

An example and a simulation study illustrate the adaptive design. The type I error rate, the power and the sample size of the adaptive design are compared to those of a fixed design. Both designs hold the type I error rate. The adaptive design reaches the advertised power. The fixed design can either be over-or underpowered depending on a possibly wrong assumption regarding the sample size calculation.

The adaptive design compensates inefficiencies of the sample size calculation and therefore it supports to reach the desired study aim.


[1] Committee for Medicinal Products for Human Use, Guideline on clinical evaluation of diagnostic agents. Available at: http://www. ema.europa.eu. 2009, 1-19.

[2] O. S. Miettinen, The matched pairs design in the case of all-or-none responses. 24 2 1968, 339-352.

Sample size calculation and blinded re-estimation for diagnostic accuracy studies considering missing or inconclusive test results
Cordula Blohm1, Peter Schlattmann2, Antonia Zapf3
1Universität Heidelberg; 2Universitätsklinikum Jena; 3Universitätsklinikum Hamburg-Eppendorf


For diagnostic accuracy studies the two independent co-primary endpoints, sensitivity and specificity, are relevant. Both parameters are calculated based on the disease status and the test result that is evaluated either as positive or negative. Sometimes the test result is neither positive nor negative but inconclusive or even missing. There are four frequently used methods of handling missing values available where such results are counted as missing, positive, negative, or false positive and false negative. The first three approaches may lead to an overestimation of both parameters, or either sensitivity or specificity, respectively. In the fourth approach, the intention to diagnose principle (ITD), both parameters decrease and a more realistic picture of the clinical potential of diagnostic tests is provided (Schuetz et al. 2012).

Sensitivity and specificity are also key parameters in sample size calculation of diagnostic accuracy studies and a realistic estimate of them is mandatory for the success of a trial. Therefore, the consideration of inconclusive results in the initial sample size calculation and, especially, in a blinded sample size re-calculation based on an interim analysis could improve trial design.


For sample size calculation, the minimum sensitivity and specificity, the type I error rate and the power are defined. In addition, the expected sensitivity and specificity of the experimental test, the prevalence, and the proportion of inconclusive results are assumed. For the simulation study different scenarios are chosen by varying these parameters. The optimal sample size is calculated according to Stark and Zapf (2020). The inconclusive results are generated independently of disease status and randomly distributed over diseased and non-diseased subjects. The sensitivity and specificity of the experimental test are estimated while considering the four different methods, mentioned above, to handle inconclusive results.

The sample size re-recalculation is performed with a blinded one-time re-estimation of the proportion of inconclusive results. The power, the type I error rate, and the bias of estimated sensitivity and specificity are used as performance measures.


The simulation study aims to evaluate the influence of inconclusive results on the evaluation of diagnostic test accuracy in an adaptive study design. The performance difference of the four methods to handle inconclusive results will be discussed.

Adaptive Designs I

Chairs: Thomas Asendorf and Rene Schmidt

Statistical Issues in Confirmatory Platform Trials
Martin Posch, Elias Meyer, Franz König
Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Austria

Adaptive platform trials provide a framework to simultaneously study multiple treatments in a disease. They are multi-armed trials where interventions can enter and leave the platform based on interim analyses as well as external events, for example, if new treatments become available [3]. The attractiveness of platform trials compared to separate parallel group trials is not only due to operational aspects as a joint trial infrastructure and more efficient patient recruitment, but results also from the possibility to share control groups, to efficiently prune non-efficacious treatments, and to allow for direct comparisons between experimental treatment arms [2]. However, the flexibility of the framework also comes with challenges for statistical inference and interpretation of trial results such as the adaptivity of platform trials (decisions on the addition or dropping of arms cannot be fully pre-specified and may have an impact on the recruitment for the current trial arms), multiplicity issues (due to multiple interventions, endpoints, subgroups and interim analyses) and the use of shared controls (which may be non-concurrent controls or control groups where the control treatment changes over time). We will discuss current controversies and the proposed statistical methodology to address these issues [1,3,4]. Furthermore, we give an overview of the IMI project EU-PEARL (Grant Agreement no. 853966) that aims to establish a general framework for platform trials, including the necessary statistical and methodological tools.

[1] Collignon O., Gartner C., Haidich A.-B., Hemmings R.J., Hofner B., Pétavy F., Posch M., Rantell K., Roes K., Schiel A. Current Statistical Considerations and Regulatory Perspectives on the Planning of Confirmatory Basket, Umbrella, and Platform Trials. Clinical Pharmacology & Therapeutics 107(5), 1059–1067, (2020)

[2] Collignon O., Burman C.F., Posch M., Schiel A.Collaborative platform trials to fight COVID-19: methodological and regulatory considerations for a better societal outcome. Clinical Pharmacology & Therapeutics (to appear)

[3] Meyer E.L., Mesenbrink P., Dunger-Baldauf C., Fülle H.-J., Glimm E., Li Y., Posch M., König F. The Evolution of Master Protocol Clinical Trial Designs: A Systematic Literature Review. Clinical Therapeutics 42(7), 1330–1360, (2020)

[4] Posch, M., & König, F. (2020).  Are p-values Useful to Judge the Evidence Against the Null Hypotheses in Complex Clinical Trials? A Comment on “The Role of p-values in Judging the Strength of Evidence and Realistic Replication Expectations”. Statistics in Biopharmaceutical Research, 1-3, (2002)

Type X Error: Is it time for a new concept?
Cornelia Ursula Kunz
Boehringer Ingelheim Pharma GmbH & Co. KG, Germany

A fundamental principle of how we decide between different trial designs and different test statistics is the control of error rates as defined by Neyman and Pearson, namely the type I error rate and the type II error rate. The first one controlling the probability to reject a true null hypothesis and the second one controlling the probability to not reject a false null hypothesis. When Neyman and Pearson first introduced the concepts of type I and type II error, they could not have predicted the increasing complexity of many trials conducted today and the problems that arise with them.

Modern clinical trials often try to address several clinical objectives at once, hence testing more than one hypothesis. In addition, trial designs are becoming more and more flexible, allowing to adapt ongoing trial by changing, for example, number of treatment arms, target populations, sample sizes and so on. It is also known that in some cases the adaptation of the trial leads to a change of the hypothesis being tested as for example happens when the primary endpoint of the trial is changed at an interim analysis.

While their focus was on finding the most powerful test for a given hypothesis, we nowadays often face the problem of finding the right trial design in the first place before even attempting on finding the most powerful or in some cases even just a test at all. Furthermore, when more than one hypothesis is being tested family-wise type I error control in the weak or strong sense also has to be addressed with different opinions on when we need to control for it and when we might not need to.

Based on some trial examples, we show that the more complex the clinical trial objectives, the more difficult it might be to establish a trial that is actually able to answer the research question. Often it is not sufficient or even possible to translate the trial objectives into simple hypotheses that are then being tested by some most powerful test statistic. However, when the clinical trial objectives cannot completely be addressed by a set of null hypotheses, type I and type II error might not be sufficient anymore to decide on the admissibility of a trial design or test statistic. Hence, we raise the question whether a new kind of error should be introduced.

Control of the population-wise error rate in group sequential trials with multiple populations
Charlie Hillner, Werner Brannath
Competence Center for Clinical Trials, Germany

In precision medicine one is often interested in clinical trials that investigate the efficacy of treatments that are targeted to specific sub-populations defined by genetic and/or clinical biomarkers. When testing hypotheses in multiple populations multiplicity adjustments are needed. First, we propose a new multiple type I error criterion for clinical trials with multiple intersecting populations, which is based on the observation that not all type I errors are relevant to all patients in the overall population. If the sub-populations are disjoint, no adjustment for multiplicity appears necessary, since a claim in one sub-population does not affect patients in the other ones. For intersecting sub-populations we suggest to control the probability that a randomly selected patient will be exposed to an inefficient treatment, which is an average multiple type I error. We propose group sequential designs that control the PWER where possibly multiple treatments are investigated in multiple populations. To this end, an error spending approach that ensures PWER-control is introduced. We exemplify this approach for a setting of two intersecting sub-populations and discuss how the number of different treatments to be tested in each sub-population affects the critical boundaries needed for PWER-control. Lastly, we apply this error spending approach to a group sequential design example from Magnusson & Turnbull (2013), where the efficacy of one treatment is to be tested after a certain sub-population that is likely to benefit from the treatment is found. We compare our PWER-controlling method with their FWER-controlling method in terms of critical boundaries and the resulting rejection probabilities and expected information.


Magnusson, B.P. and Turnbull, B.W. (2013), Group sequential enrichment design incorporating subgroup selection. Statist. Med., 32: 2695-2714. https://doi.org/10.1002/sim.5738

Adaptive group sequential designs for phase II trials with multiple time-to-event endpoints
Moritz Fabian Danzer1, Tobias Terzer2, Andreas Faldum1, Rene Schmidt1
1Institute of Biostatistics and Clinical Research, University of Münster, Germany; 2Division of Biostatistics, German Cancer Research Center, Heidelberg, Germany

Existing methods concerning the assessment of long-term survival outcomes in one-armed trials are commonly restricted to one primary endpoint. Corresponding adaptive designs suffer from limitations regarding the use of information from other endpoints in interim design changes. Here we provide adaptive group sequential one-sample tests for testing hypotheses on the multivariate survival distribution derived from multi-state models, while making provision for data-dependent design modifications based on all involved time-to-event endpoints. We explicitly elaborate application of the methodology to one-sample tests for the joint distribution of (i) progression-free survival (PFS) and overall survival (OS) in the context of an illnessdeath model, and (ii) time to toxicity and time to progression while accounting for death as a competing event. Large sample distributions are derived using a counting process approach. Small sample properties and sample size planning are studied by simulation. An already established multi-state model for non-small cell lung cancer is used to illustrate the adaptive procedure.

Panel Discussion: Networking – Bessere Karrierechancen durch Netzwerken?

Chairs: Bjoern-Hergen Laabs and Stefanie Peschel

Ob in der Wirtschaft oder Wissenschaft – viele Beschäftigte wären wohl ohne ein gutes berufliches Netzwerk nicht in ihrer heutigen Position. Aber ist ein umfangreiches Netzwerk wirklich notwendig oder sind gute Leistungen ausreichend? Hätten sich dieselben Karrieretüren vielleicht auch ohne ein gutes Netzwerk geöffnet? Diesen und vielen weiteren Fragen rund um das Thema Netzwerken wollen wir in der Podiumsdiskussion auf dem 67. Biometrischen Kolloquium nachgehen.

Wir freuen uns, die folgenden Gäste auf unserem virtuellen Podium begrüßen zu dürfen:

  • Dr. Ralph Brinks (Universität Witten/Herdecke)
  • Dr. Ronja Foraita (Leibniz-Institut für Präventionsforschung und Epidemiologie – BIPS, Bremen)
  • Dr. Anke Huels (Emory University, Atlanta, USA)
  • Minh-Anh Le (Munich RE, München)
  • Prof. Dr. Christian L. Müller (Helmholtz Zentrum München; LMU München; Simons Foundation, New York, USA)
  • Jessica Rohmann (Charité – Universitätsmedizin Berlin)

Unter der Moderation von Stefanie Peschel (Helmholtz Zentrum München), möchten wir das Thema Networking zusammen mit unseren Gästen von verschiedenen Seiten beleuchten. Dabei soll es unter anderem um folgende Themen gehen:

  • Twitter und Co. – Wie nutze ich Social-Media-Kanäle richtig?
  • LinkedIn und Xing – Wie stelle ich mich gut dar?
  • Mitgliedschaft bei Fachgesellschaften – Welche Möglichkeiten des Netzwerkens ergeben sich daraus?
  • Networking in virtuellen Meetings – Geht das überhaupt?
  • Weitere Möglichkeiten – Gibt es noch andere Wege, ein gutes Netzwerk aufzubauen?

Da diese Themen mit Sicherheit nicht alle Fragen unseres Publikums abdecken, laden wir unsere Zuhörerschaft herzlich ein, Fragen an die Diskutant*innen zu stellen. Dafür wird es während der Session eine Fragerunde geben und auch über unseren Chat können jederzeit Fragen gestellt werden. Aber auch nach der Veranstaltung wird es noch die Möglichkeit geben, mit unseren Gästen in Kontakt zu treten. Gerne dürfen auch schon vorab Fragen per E-Mail an
ag-nachwuchs@googlegroups.com gestellt werden.

Die Veranstaltung findet auf Deutsch statt. Organisiert wird die Podiumsdiskussion von der AG Nachwuchs der IBS-DR.

Data Sharing and Reproducible Research

Chairs: Ronja Foraita and Iris Pigeot

The Statistical Assessment of Replication Success
Leonhard Held
Epidemiology, Biostatistics and Prevention Institute (EBPI) and Center for Reproducible Science (CRS), University of Zurich

Replicability of research findings is crucial to the credibility of all empirical domains of science. However, there is no established standard how to assess replication success and in practice many different approaches are used. Statistical significance of both the original and replication study is known as the two-trials rule in drug regulation but does not take the corresponding effect sizes into account.

We compare the two-trials rule with the sceptical p-value (Held, 2020), an attractive compromise between hypothesis testing and estimation. This approach penalizes shrinkage of the replication effect estimate compared to the original one, while ensuring that both are also statistically significant to some extent. We describe a recalibration of the procedure as proposed in Held et al (2020), the golden level. The golden level guarantees that borderline significant original studies can only be replicated successfully if the replication effect estimate is larger than the original one. The recalibrated sceptical p-value offers uniform gains in project power compared to the two-trials rule and controls the Type-I error rate except for very small replication sample sizes. An application to data from four large replication projects shows that the new approach leads to more appropriate inferences, as it penalizes shrinkage of the replication estimate compared to the original one, while ensuring that both effect estimates are sufficiently convincing on their own. Finally we describe how the approach can also be used to design the replication study based on specification of the minimum relative effect size to achieve replication success.

Held, Leonhard (2020) A new standard for the analysis and design of replication studies (with discussion). Journal of the Royal Statistical Society, Series A, 183:431–469.

Held, Leonhard and Micheloud, Charlotte and Pawel, Samuel (2020). The assessment of replication success based on relative effect size. https://arxiv.org/abs/2009.07782

Multivariate regression modelling with global and cohort-specific effects in a federated setting with data protection constraints
Max Behrens, Daniela Zöller
University of Freiburg, Germany

Multi-cohort studies are an important tool to study effects on a large sample size and to identify cohort-specific effects. Thus, researchers would like to share information between cohorts and research institutes. However, data protection constraints forbid the exchange of individual-level data between different research institutes. To circumvent this problem, only non-disclosive aggregated data is exchanged, which is often done manually and requires explicit permission before transfer. The framework DataSHIELD enables automatic exchange in iterative calls, but methods for performing more complex tasks such as federated optimisation and boosting techniques are missing.

We propose an iterative optimization of multivariate regression models which condenses global (cohort-unspecific) and local (cohort-specific) predictors. This approach will be solely based on non-disclosive aggregated data from different institutions. The approach should be applicable in a setting with high-dimensional data with complex correlation structures. Nonetheless, the amount of transferred data should be limited to enable manual confirmation of data protection compliance.

Our approach implements an iterative optimization between local and global model estimates. Herein, the linear predictor of the global model will act as a covariate in the local model estimation. Subsequently, the linear predictor of the updated local model is included in the global model estimation. The procedure is repeated until no further model improvement is observed for the local model estimates. In case of an unknown variable structure, our approach can be extended with an iterative boosting procedure performing variable selection for both the global and local model.

In a simulation study, we aim to show that our approach improves both global and local model estimates while preserving the globally found effect structure. Furthermore, we want to demonstrate the approach to grant protected access to a multi-cohort data pool concerning gender sensitive studies. Specifically, we aim to apply the approach to improve upon cohort-specific model estimates by incorporating a global model based on multiple cohorts. We will apply the method to real data obtained in the GESA project, where we combined data from the three large German population-based cohorts GHS, SHIP, and KORA to identify potential predictors for mental health protectories.

In general, all gradient-based methods can be adapted easily to a federated setting under data protection constraints. The here presented method can be used in this setting to perform iterative optimisation and can thus aid in the process of understanding cohort-specific estimates. We provide an implementation in the DataSHIELD framework.

A replication crisis in methodological statistical research?
Anne-Laure Boulesteix1, Stefan Buchka1, Alethea Charlton1, Sabine Hoffmann1, Heidi Seibold2, Rory Wilson2
1LMU Munich, Germany; 2Helmholtz Zentrum Munich, Germany

Statisticians are often keen to analyze the statistical aspects of the so-called “replication crisis”. They condemn fishing expeditions and publication bias across empirical scientific fields applying statistical methods. But what about good practice issues in their own – methodological – research, i.e. research considering statistical methods as research objects? When developing and evaluating new statistical methods and data analysis tools, do statisticians adhere to the good practice principles they promote in fields which apply statistics? I argue that statisticians should make substantial efforts to address what may be called the replication crisis in the context of methodological research in statistics and data science. In the first part of my talk, I will discuss topics such as publication bias, the design and necessity of neutral comparison studies and the importance of appropriate reporting and research synthesis in the context of methodological research.

In the second part of my talk I will empirically illustrate a specific problem which affects research articles presenting new data analysis methods. Most of these articles claim that “the new method performs better than existing methods”, but the veracity of such statements is questionable. An optimistic bias may arise during the evaluation of novel data analysis methods resulting from, for example, selection of datasets or competing methods; better ability to fix bugs in a preferred method; and selective reporting of method variants. This bias is quantitatively investigated using a topical example from epigenetic analysis: normalization methods for data generated by the Illumina HumanMethylation450K BeadChip microarray.

Reproducible bioinformatics workflows: A case study with software containers and interactive notebooks
Anja Eggert, Pal O Westermark
Leibniz Institute for Farm Animal Biology, Deutschland

We foster transparent and reproducible workflows in bioinformatics, which is challenging given their complexity. We developed a new statistical method in the field of circadian rhythmicity, which allows to rigorously determine whether measured quantities such as gene expressions are not rhythmic. Knowledge of no or at most weak rhythmicity may significantly simplify studies, aid detection of abolished rhythmicity, and facilitate selection of non-rhythmic reference genes or compounds, among other applications. We present our solution to this problem in the form of a precisely formulated mathematical statistic accompanied by a software called SON (Statistics Of Non-rhythmicity). The statistical method itself is implemented in the R package “HarmonicRegression”, available on the CRAN repository. However, the bioinformatics workflow is much larger than the statistical test. For instance, to ensure the applicability and validity of the statistical method, we simulated data sets of 20,000 gene expressions over two days, with a large range of parameter combinations (e.g. sampling interval, fraction of rhythmicity, amount of outliers, detection limit of rhythmicity, etc.). Here we describe and demonstrate the use of a Jupyter notebook to document, specify, and distribute our new statistical method and its application to both simulated and experimental data sets. Jupyter notebooks combine text documentation with dynamically editable and executable code and are an implementation of the concept of literate programming. Thus, parameters and code can be modified, allowing both verification of results, as well as instant experimentation by peer reviewers and other users of the science community. Our notebook runs inside a Docker software container, which mirrors the original software environment. This approach avoids the need to install any software and ensures complete long-term reproducibility of the workflow. This bioinformatics workflow allows full reproducibility of our computational work.

Young Statisticians Session

Chairs: Bjoern-Hergen Laabs and Janine Witte

Predictions by random forests – confidence intervals and their coverage probabilities
Diana Kormilez, Björn-Hergen Laabs, Inke R. König
Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Germany

Random forests are a popular supervised learning method. Their main purpose is the robust prediction of an outcome based on a learned set of rules. To evaluate the precision of predictions their scattering and distributions are important. In order to quantify this, 95 % confidence intervals for the predictions can be generated using suitable variance estimators. However, these variance estimators may be under- or overestimated and the confidence intervals thus cover ranges either too small or too large, which can be evaluated by estimating coverage probabilities through simulations. The aim of our study was to examine coverage probabilities for two popular variance estimators for predictions made by random forests, the infinitesimal jackknife according to Wager et al. (2014) and the fixed-point based variance estimator according to Mentch and Hooker (2016). We performed a simulation study considering different scenarios with varying sample sizes and various signal-to-noise ratios. Our results show that the coverage probabilities based on the infinitesimal jackknife are lower than the desired 95 % for small data sets and small random forests. On the other hand, the variance estimator according to Mentch and Hooker (2016) leads to overestimated coverage probabilities. However, a growing number of trees yields decreasing coverage probabilities for both methods. A similar behavior was observed when using real datasets, where the composition of the data and the number of trees influence the coverage probabilities. In conclusion, we observed that the relative performance of one variance estimation method over the other depends on the hyperparameters used for training the random forest. Likewise, the coverage probabilities can be used to evaluate how well the hyper-parameters were chosen and whether the data set requires more pre-processing.

Mentch L, Hooker G (2016): Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests. J. Mach. Learn. Res., 17:1-41.

Wager S, Hastie T, Efron B (2014): Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. J. Mach. Learn. Res., 15:1625-1651.

Over-optimism in benchmark studies and the multiplicity of analysis strategies when interpreting their results
Christina Nießl1, Moritz Herrmann2, Chiara Wiedemann1, Giuseppe Casalicchio2, Anne-Laure Boulesteix1
1Institute for Medical Informatics, Biometry and Epidemiology, University of Munich (Germany); 2Department of Statistics, University of Munich (Germany)

In recent years, the need for neutral benchmark studies that focus on the comparison of statistical methods has been increasingly recognized. At the interface between biostatistics and bioinformatics, benchmark studies are especially important in research fields involving omics data, where hundreds of articles present their newly introduced method as superior to other methods.

While general advice on the design of neutral benchmark studies can be found in recent literature, there is always a certain amount of flexibility that researchers have to deal with. This includes the choice of datasets and performance metrics, the handling of missing values in case of algorithm failure (e.g., due to non-convergence) and the way the performance values are aggregated over the considered datasets. Consequently, different choices in the benchmark design may lead to different results of the benchmark study.

In the best-case scenario, researchers make well-considered design choices prior to conducting a benchmark study and are aware of this issue. However, they may still be concerned about how their choices affect the results. In the worst-case scenario, researchers could (most often subconsciously) use this flexibility and modify the benchmark design until it yields a result they deem satisfactory (for example, the superiority of a certain method). In this way, a benchmark study that is intended to be neutral may become biased.

In this paper, we address this issue in the context of benchmark studies based on real datasets using an example benchmark study, which compares the performance of survival prediction methods on high-dimensional multi-omics datasets. Our aim is twofold. As a first exploratory step, we examine how variable the results of a benchmark study are by trying all possible combinations of choices and comparing the resulting method rankings. In the second step, we propose a general framework based on multidimensional unfolding that allows researchers to assess the impact of each choice and identify critical choices that substantially influence the resulting method ranking. In our example benchmark study, the most critical choices were the considered datasets and the performance metric. However, in some settings, we observed that the handling of missing values and the aggregation of performances over the datasets can also have a great influence on the results.

Variable Importance Measures for Functional Gradient Descent Boosting Algorithm
Zeyu Ding
TU Dortmund, Germany

With the continuous growth of data dimensions and the improvement of computing power in contemporary statistics, the number of variables that can be included in a model is increasing significantly. Therefore, when designing a model, selecting truly informative variables from a bunch of messy variables and ranking them according to their importance becomes a core problem of statistics. Appropriate variable selection can reduce overfitting and leads to a more interpretable model. Traditional methods use the decomposition of R2, step-wise Akaike Information Criterion (AIC)-based variable selection, or regularization based on lasso and ridge. In ensemble algorithms, the variable importance is often calculated separately by the permutation methods. In this contribution, we propose two new stable and discriminating variable importance measures for the functional gradient descent boosting algorithm (FGDB). The first one calculates the l2 norm contribution of a variable while the second one calculates the risk reduction of the variables in every iteration. Our proposal is demonstrated in both simulation and real data examples. We show that the two new methods are more effective in automatically selecting those truly important variables in different data scenarios than the traditional selection frequency measures used in FGDB algorithm. This holds for both linear and non-linear models under different data scenarios.

keywords: variable importance measures, variable selection, functional gradient boosting, component-wise regression, generalized additive models.

Quality control in genome-wide association studies revisited: a critical evaluation of the standard methods
Hanna Brudermann1, Tanja K. Rausch1,2, Inke R. König1
1Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Germany, Germany; 2Department of Pediatrics, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany

Genome-wide association studies (GWAs) investigating the relationship between millions of genetic markers and a clinically relevant phenotype were originally based on the common disease – common variant assumption, thus aiming at identifying a small number of common genetic loci as cause for common diseases. Given the enormous cost reduction in the acquisition of genomic data, it is not surprising that since the first known GWA by Klein et al. (2005), this study type was established as a standard method. However, since even low error frequencies can distort association results, extensive and accurate quality control of the given data is mandatory. In recent years, the focus of GWAs has shifted, and the task is no longer primarily the discovery of common genetic loci. Also, with increasing sample sizes and (mega-)meta-analyses of GWAs, it is hoped that loci with small effects can be identified. Furthermore, it has become popular to aggregate all genomic information, even loci with very small effects and frequencies, into genetic risk prediction scores, thus increasing the requirement for high-quality genetic data.

However, after extensive discussions about standards for quality control in GWAs in the early years, further work on how to control data quality and adapt data cleaning to new GWAs aims has become scarce.

The aim of this study was to perform an extensive literature review to evaluate currently applied quality control criteria and their justification. Building on the findings from the literature search, a workflow was developed to include justified quality control steps, keeping in mind that a strict quality control, which removes all data with a high risk of bias, always carries the risk that the remaining data is too homogeneous to make small effects visible. This workflow is subsequently illustrated using a real data set.

Our results show that in most published GWAs, no scientific reasons for the applied quality steps are given. Cutoffs for the most common quality measures are mostly not explained. Especially the principal component analysis and the test for deviation from Hardy-Weinberg equilibrium are frequently used as quality criteria in many GWAs without analyzing the existing conditions exactly and adjusting the quality control accordingly.

It is pointed out that researchers still have to decide between universal and individual parameters and therefore between optimal comparability to other analyses and optimal conditions within the specific study.

Which Test for Crossing Survival Curves? A User’s Guide
Ina Dormuth1, Tiantian Liu2,4, Jin Xu2, Menggang Yu3, Markus Pauly1, Marc Ditzhaus1
1TU Dortmund, Deutschland; 2East China Normal University, China; 3University of Wisconsin-Madison, USA; 4Technion ‐ Israel Institute of Technology, Israel

Knowledge transfer between statisticians developing new data analysis methods, and users is essential. This is especially true for clinical studies with time-to-event endpoints. One of the most common problems is the comparison of survival in two-armed trials. The log-rank test is still the gold standard for answering this question. However, in the case of non-proportional hazards, its informative value may decrease. In the meantime several extensions have been developed to solve this problem. Since non-proportional or even intersecting survival curves are common in oncology, e.g. in immunotherapy studies, it is important to identify the most appropriate methods and to draw attention to their existence. Therefore, it is our goal to simplify the choice of a test to detect differences in survival rate in case of crossings. To this end, we reviewed 1,400 recent oncological studies. Limiting our analysis to intersecting survival curves and non-significant log-rank tests for a sufficient number of observed events we reconstructed the data sets using a state-of-the-art reconstruction algorithm. To ensure reproductive quality, only publications with a published number of risk at multiple points in time, sufficient print quality, and a non-informative censoring pattern were included. After elimination of papers on the basis of the exclusion criteria mentioned above, we compared the p-values of the log-rank and the Peto-Peto test as references and compare them with nine different tests for non-proportional or even crossing hazards. It is shown that tests designed to detect crossing hazards are advantageous and provide guidance in choosing a reasonable alternative to the standard log-rank test. This is followed by a comprehensive simulation study and the generalization of one of the test methods to the multi-sample case.