Track: Track 2

Adaptive Designs III

Chairs: Tobias Mütze and Martin Posch

Adaptive group sequential survival comparisons based on log-rank and pointwise test statistics
Jannik Feld, Andreas Faldum, Rene Schmidt
Institute of Biostatistics and Clinical Research, University of Münster

Whereas the theory of confirmatory adaptive designs is well understood for uncensored data, implementation of adaptive designs in the context of survival trials remains challenging. Commonly used adaptive survival tests are based on the independent increments structure of the log-rank statistic. These designs suffer the limitation that effectively only the interim log-rank statistic may be used for design modifications (such as data-dependent sample size recalculation). Alternative approaches based on the patient-wise separation principle have the common disadvantage that the test procedure may either neglect part of the observed survival data or tends to be conservative. Here, we instead propose an extension of the independent increments approach to adaptive survival tests. We present a confirmatory adaptive two-sample log-rank test of no difference in a survival analysis setting, where provision is made for interim decision making based on both the interim log-rank statistic and/or pointwise survival-rates, while avoiding aforementioned problems. The possibility to include pointwise survival-rates eases the clinical interpretation of interim decision making and is a straight forward choice for seamless phase II/III designs. We will show by simulation studies that the performance does not suffer using the pointwise survival-rates and exemplary consider application of the methodology to a two-sample log-rank test with binding futility criterion based on the observed short-term survival-rates and sample size recalculation based on conditional power. The methodology is motivated by the LOGGIC Europe Trial from pediatric oncology. Distributional properties are derived using martingale techniques in the large sample limit. Small sample properties are studied by simulation.

Single-stage, three-arm, adaptive test strategies for non-inferiority trials with an unstable reference
Werner Brannath, Martin Scharpenberg, Sylvia Schmidt
University of Bremen, Germany

For indications where only unstable reference treatments are available and use of placebo is ethically justified, three-arm `gold standard‘ designs with an experimental, reference and placebo arm are recommended for non-inferiority trials. In such designs, the demonstration of efficacy of the reference or experimental treatment is a requirement. They have the disadvantage that only little can be concluded from the trial if the reference fails to be efficacious. To overcome this, we investigate a novel single-stage, adaptive test strategies where non-inferiority is tested only if the reference shows sufficient efficacy and otherwise delta-superiority of the experimental treatment over placebo is tested. With a properly chosen superiority margin, delta-superiority indirectly shows non-inferiority. We optimize the sample size for several decision rules and find that the natural, data driven test strategy, which tests with non-inferiority if the reference’s efficacy test is significant, leads to the smallest overall and placebo sample sizes. Under specific constraints on the sample sizes, this procedure controls the family-wise error rate. All optimal sample sizes are found to meet this constraint. We finally show how to account for a relevant placebo drop-out rate in an efficient way and apply the new test strategy to a real life data set.

Sample size re-estimation based on the prevalence in a randomized test-treatment study
Amra Hot, Antonia Zapf
Institute of Medical Biometry and Epidemiology, University Medical Center Hamburg-Eppendorf, Hamburg

Patient benefit should be the primary criterion in evaluating diagnostic tests. If a new test has shown sufficient accuracy, its application in clinical practice should yield to a patient benefit. Randomized test-treatment studies are needed to assess the clinical utility of a diagnostic test as part of a broader management regimen in which test-treatment strategies are compared in terms of their impact on patient relevant outcomes [1]. Due to their increased complexity compared to common intervention trials the implementation of such studies poses practical challenges which might affect the validity of the study. One important aspect is the sample size determination. It is a special feature of these designs that they combine information on the disease prevalence and accuracy of the diagnostic tests, i.e. sensitivity and specificity of the investigated tests, with assumptions on the expected treatment effect. Due to the lack of empirical information or uncertainty regarding these parameters sample size consideration will always be based on a rather weak foundation, thus leading to an over- or underpowered trial. Therefore, it is reasonable to consider adaptations in earlier phases of the trial based on a pre-specified interim analysis in order to solve this problem. A blinded sample size re-estimation based on the disease prevalence in a randomized test-treatment study was performed as part of a simulation study. The type I error, the empirical overall power as well as the bias of the estimated prevalence are assessed and presented.


[1] J. G. Lijmer, P.M. Bossuyt. Diagnostic testing and prognosis: the randomized controlled trial in test evaluation research. In: The evidence base of clinical diagnosis. Blackwell Oxford, 2009, 63-82.

Performance evaluation of a new “diagnostic-efficacy-combination trial design” in the context of telemedical interventions
Mareen Pigorsch1, Martin Möckel2, Jan C. Wiemer3, Friedrich Köhler4, Geraldine Rauch1
1Charité – Universitätsmedizin Berlin, Institute of Biometry and clinical Epidemiology; 2Charité – Universitätsmedizin Berlin, Division of Emergency and Acute Medicine, Cardiovascular Process Research; 3Clinical Diagnostics, Thermo Fisher Scientific; 4Charité – Universitätsmedizin Berlin, Centre for Cardiovascular Telemedicine, Department of Cardiology and Angiology


Telemedical interventions in heart failure patients intend to avoid unfavourable, treatment-related events by an early, individualized care, which reacts to the current patients need. However, telemedical support is an expensive intervention and only patients with a high risk for unfavourable follow-up events will profit from telemedical care. Möckel et al. therefore adapted a “diagnostic-efficacy-combination design” which allows to validate a biomarker and investigate a biomarker-selected population within the same study. For this, cut-off values for the biomarkers were determined based on the observed outcomes in the control group to define a high-risk subgroup. This defines the diagnostic design step. These cut-offs were subsequently applied to the intervention and the control group to identify the high-risk subgroup. The intervention effect is then evaluated by comparison of these subgroups. This defines the efficacy design step. So far, it has not been evaluated if this double use of the control group for biomarker validation and efficacy comparison leads to a bias in treatment effect estimation. In this methodological research work, we therefore want to evaluate whether the “diagnostic-efficacy-combination design” leads to biased treatment effect estimates. If there is a bias, we further want to analyse its impact and the parameters influencing its size.


We perform a systematic Monte-Carlo simulation study to investigate potential bias in various realistic trial scenarios that mimic and vary the true data of the published TIM‐HF2 Trial. In particular we vary the event rates, the sample sizes and the biomarker distributions.


The results show, that indeed the proposed design leads to some bias in the effect estimators, indicating an overestimation of the effect. But this bias is relatively small in most scenarios. The larger the sample size, the more the event rates differ in the control and the intervention group and the better the biomarker can separate the high-risk from the low-risk patients, the smaller is the resulting relative bias.


The “diagnostic-efficacy-combination design” can be recommended for clinical applications. We recommend ensuring a sufficient large sample size.


Möckel M, Koehler K, Anker SD, Vollert J, Moeller V, Koehler M, Gehrig S, Wiemer JC, von Haehling S, Koehler F. Biomarker guidance allows a more personalized allocation of patients for remote patient management in heart failure: results from the TIM-HF2 trial. Eur J Heart Fail. 2019;21(11):1445-58.

Valid sample size re-estimation at interim
Nilufar Akbari
Charité – Institute of Biometry and Clinical Epidemiology, Germany

Throughout this work, we consider the situation of a two-arm controlled clinical trial based on time-to-event data.

The aim of this thesis is to estimate a meaningful survival model in a robust way to observed data during an interim analysis in order to carry out a valid sample size recalculation.

Adaptive designs provide an attractive possibility of changing study design parameters in an ongoing trial. There are still many open questions with respect to adaptive designs for time-to-event data. Among other things, this is because survival data, unlike continuous or binary data, undertake a follow-up phase, so that the outcome is not directly observed after patient’s treatment.

Evaluating survival data at interim analyses leads to a patient overrun since the recruitment is usually not stopped at the same time. Another problem is that there must be an interim analysis during this recruitment phase to save patients. Moreover, the timing of the interim analysis is a crucial point build decisions upon a reasonable level of information.

A general issue about time-to-event data is that at an interim analysis one can only calculate the updated size of the required number of events. However, there is normally a greater need in the determination of the sample size to achieve that required number of events. Therefore, the underlying event-time distribution is needed, which may possibly be estimated from the interim data.

This however, is a difficult task for the following reasons: The number of observed events at interim is limited, and the survival curve at interim is truncated by the interim time point.

The goal of this research work is to fit a reasonable survival model to the observed data in a robust way. The fitted curve has the following advantages: The underlying hazards per group can be estimated which allows updating the required number of patients for achieving the respective number of events. Finally, the impact of overrun can be directly assessed and quantified.

The following problems were additionally evaluated in detail. How much do the hazards deviate if the wrong event-time distribution was estimated? At which point in time is a sample size re-estimation useful, or rather how many events are required, for a valid sample size re-estimation at interim?

Multiple Testing

Chairs: Arne Bathke and Robert Kwiecien

Analysis and sample size calculation for a conditional survival model with a binary surrogate endpoint
Samuel Kilian, Johannes Krisam, Meinhard Kieser
Institute of Medical Biometry and Informatics; University Heidelberg; Heidelberg, Germany

The primary endpoint in oncology is usually overall survival, where differences between therapies may only be observable after many years. To avoid withholding of a promising therapy, preliminary approval based on a surrogate endpoint is possible in certain situations (Wallach et al., 2018). The approval has to be confirmed later when overall survival can be assessed. When this is done within the same study, the correlation between surrogate endpoint and overall survival has to be taken into account for sample size calculation and analysis. This relation can be modeled by means of a conditional survival model which was proposed by Xia et al. (2014). They investigated the correlation and assessed power of the logrank test but did not develop methods for statistical testing, parameter estimation, and sample size calculation.

In this talk, a new statistical testing procedure based on the conditional model and Maximum Likelihood (ML) estimators for its parameters will be presented. An asymptotic test for survival difference will be given and an approximate sample size formula will be derived. Furthermore, an exact test for survival difference and an algorithm for exact sample size determination will be provided. Type I error rate, power, and required sample size for both newly developed tests will be determined exactly. Sample sizes will be compared to those required for the logrank test.

It will be shown that for small sample sizes the asymptotic parametric test and the logrank test exceed the nominal significance level under the conditional model. For a given sample size, the power of the asymptotic and the exact parametric test is similar, whereas the power of the logrank test is considerably lower in many situations. The other way round, the sample size needed to attain a prespecified power is comparable for the asymptotic and the exact parametric test, but considerably higher for the logrank test in many situations.

We conclude that the presented exact test performs very well under the assumptions of the conditional model and is a better choice than the asymptotic parametric test or the logrank test, respectively. Furthermore, the talk will give some insights in performing exact calculations for parametric survival time models. This provides a fast and powerful method to evaluate parametric tests for survival difference, thus facilitating the planning, conduct, and analysis of oncology trials with the option of accelerated approval.

The max-t Test in High-Dimensional Repeated Measures and Multivariate Designs
Frank Konietschke
Charite Berlin, Germany

Repeated measures (and multivariate) designs occur in a variety of different research areas. Hereby, the designs might be high-dimensional, i.e. more (possibly)

dependent than independent replications of the trial are observed. In recent years, several global testing procedures (studentized quadratic forms) have been proposed for the analysis of such data. Testing global null hypotheses, however, usually does not answer the main question of practitioners, which is the specific localization of significant time points or group*time interactions. The use of max-t tests on the contrary, can provide this important information. In this talk, we discuss its applicability in such designs. In particular, we approximate the distribution of the max t-test statistic using innovative resampling strategies. Extensive simulation studies show that the test is particularly suitable for the analysis of data sets with small sample sizes . A real data set

illustrates the application of the method.

Graphical approaches for the control of generalized error rates
Frank Bretz1, David Robertson2, James James Wason3
1Novartis, Switzerland; 2University of Cambridge, UK; 3Newcastle University, UK

When simultaneously testing multiple hypotheses, the usual approach in the context of confirmatory clinical trials is to control the familywise error rate (FWER), which bounds the probability of making at least one false rejection. In many trial settings, these hypotheses will additionally have a hierarchical structure that reflects the relative importance and links between different clinical objectives. The graphical approach of Bretz et al (2009) is a flexible and easily communicable way of controlling the FWER while respecting complex trial objectives and multiple structured hypotheses. However, the FWER can be a very stringent criterion that leads to procedures with low power, and may not be appropriate in exploratory trial settings. This motivates controlling generalized error rates, particularly when the number of hypotheses tested is no longer small. We consider the generalized familywise error rate (k-FWER), which is the probability of making k or more false rejections, as well as the tail probability of the false discovery proportion (FDP), which is the probability that the proportion of false rejections is greater than some threshold. We also consider asymptotic control of the false discovery rate, which is the expectation of the FDP. In this presentation, we show how to control these generalized error rates when using the graphical approach and its extensions. We demonstrate the utility of the resulting graphical procedures on clinical trial case studies.

Statistical Inference for Diagnostic Test Accuracy Studies with Multiple Comparisons
Max Westphal1, Antonia Zapf2
1Fraunhofer Institute for Digital Medicine MEVIS, Bremen, Germany; 2Institute of Medical Biometry and Epidemiology, UKE Hamburg, Hamburg, Germany

Diagnostic accuracy studies are usually designed to assess the sensitivity and specificity of an index test in relation to a reference standard or established comparative test. This so-called co-primary endpoint analysis has recently been extended to the case that multiple index tests are investigated [1]. Such a design is relevant in modern applications where many different (machine-learned) classification rules based on high dimensional data are considered initially as the final model selection can (partially) be based on data from the diagnostic accuracy study.

In this talk, we motivate the according hypothesis problem and propose different multiple test procedures for that matter. Besides classical parametric corrections (Bonferroni, maxT) we also consider Bootstrap approaches and a Bayesian procedure. We will present early findings from a simulation study to compare the (family-wise) error rate and power of all procedures.

A general observation from the simulation study is the wide variability of rejection rates under different (realistic and least-favorable) parameter configurations. We discuss these findings and possible future extensions of our numerical experiments. All methods have been implemented in a new R package which will also be introduced briefly.


1. Westphal, Max, Antonia Zapf, and Werner Brannath. „A multiple testing framework for diagnostic accuracy studies with co-primary endpoints.“ arXiv preprint arXiv:1911.02982 (2019).

IBS-DR Mitgliederversammlung

Vorsitz: IBS-DR Vorstand

Tagesordnung der Mitgliederversammlung 2021

TOP 1 Verabschiedung der Tagesordnung Brannath
TOP 2 Genehmigung des Protokolls der Mitgliederversammlung vom 09.09.2020 Scharpenberg
TOP 3 Bericht des Präsidenten Brannath
TOP 4 Nachwuchspreise Brannath
TOP 5 Berichte aus den internationalen Gremien Bretz, Ickstadt, Kieser, Kübler, Pigeot, Ziegler
TOP 6 Bericht des Schriftführers Scharpenberg
TOP 7 Bericht aus der Geschäftsstelle Scharpenberg
TOP 8 Bericht des Schatzmeisters Knapp
TOP 9 Bericht der Kassenprüfer Dierig, Tuğ
TOP 10 Beschlüsse über Rückstellungen und Mitgliedsbeiträge 2022 Knapp
TOP 11 Berichte aus den Arbeitsgruppen Asendorf
TOP 12 Sommerschulen, Weiterbildung Brannath
TOP 13 Zukünftige Kolloquien Brannath
TOP 14 Biometrical Journal Bathke, Schmid
TOP 15 Bericht des Wahlleiters über die Beiratswahl Gerß
TOP 16 Verschiedenes Brannath

Real World Evidence

Chairs: Sigrid Behr and Irene Schmidtmann

RCT versus RWE: Good versus evil or yin and yang?
Almut Winterstein
University of Florida, USA

Clinicians, researchers and policy makers have been raised in a paradigm that places randomized clinical trials on top of a hierarchy of evidence or that dichotomizes study designs into randomized, which is equated to valid, and not randomized, which is equated to invalid or highly dubious. Major efforts to enhance drug safety research infrastructure have shifted our acceptance of observational designs, especially in instances where the adverse event is not anticipated and unrelated to a drug’s indication, resulting in limited confounding. Other instances where evidence from non-randomized studies is accepted include situations where randomization is not feasible. The most recent evolution of real-world evidence as main source of evidence for approval of new molecular entities or indications further challenges our historic understanding of the hierarchy of evidence and the scientific method.

Through randomization and blinding, comparison groups are largely balanced on both measured and unmeasured factors if the trial has sufficient sample size. Protocol-based outcomes ascertainment ensures unbiased, structured assessments regardless of exposure status or baseline characteristics. Used jointly, RCTs can mitigate both selection and measurement biases and support causal inferences. However, besides the escalating cost of RCTs and other feasibility issues, various problems arise that require supplemental methodological approaches to inform regulatory and clinical decision-making, including poor generalizability resulting in inductive fallacy; limited ability to explore effect modification; and significant delays in evidence generation.

Legislative action to address some of these shortcomings was formalized in the United States in the 21st Century Cures Act from 2016, which is designed to help accelerate medical product development. One central component is the concept of real-world evidence, i.e., evidence about the safety and effectiveness of medications derived from real-world data. Importantly, the Cures Act formalizes the concept that valid and actionable evidence can be derived from non-experimental settings using observational study designs and advanced analytic methods. In this presentation we aim to illustrate that dichotomous approaches that contrast RCTs and RWE are limited in their understanding of the full range of methodological challenges in making causal inferences and then generalizing such inferences for real-world decision-making. Those challenges are discussed across the spectrum of traditional RCTs, pragmatic RCTs that rely on RWD or hybrid designs, and observational studies that rely on RWD. The presentation will end with specific challenges for RWE research in the era of increasing data availability and artificial intelligence.

Diagnostic accuracy of claims data from 70 million people in the German statutory health insurance: Type 2 diabetes in men
Ralph Brinks1,2,3, Thaddaeus Toennies1, Annika Hoyer2
1Deutsches Diabetes-Zentrum, Germany; 2Department of Statistics, Ludwig-Maximilians-University Munich; 3Department of Rheumatology, University Hospital Duesseldorf

During estimation of excess mortality in people with type 2 diabetes in Germany based on aggregated claims data from about 70 million people in the statutory health insurance, we experienced and reported problems in the age groups below 60 years of age [1]. We hypothesized that diagnostic accuracy (sensitivity and specificity) might be the reason for those problems [1].

In the first part of this work, we ran a simulation study to assess the impact of the diagnostic accuracy on the estimation of excess mortality. It turns out that the specificity in the younger age groups has the greatest effect on the estimate in terms of bias of the excess mortality while the sensitivity has a much lower impact.

In the second part, we apply these findings to estimate the diagnostic accuracy of type 2 diabetes in men aged 20-90 based on the approach and data from [1]. We obtain that irrespective of the sensitivity, the false positive ratio (FPR) increases linearly from 0.5 to 2 per mil from age 20 to 50. At ages 50 to 70, the FPR is likely to drop to 0.5 per mil, followed by a steep linear increase to 5 per mil at age 90.

Our examination demonstrates the crucial impact of diagnostic accuracy on estimates based on secondary data. While for other epidemiological measures sensitivity might be more important, estimation of excess mortality crucially depends on the specificity of the data. We use this fact to estimate the age-specific FPR of diagnoses of type 2 diabetes in aggregated claims data.


[1] Brinks R, Tönnies T, Hoyer A (2020) DOI 10.1186/s13104-020-05046-w

Coronary artery calcification in the middle-aged and elderly population of Denmark
Oke Gerke1,2, Jes Sanddal Lindholt1,3, Barzan Haj Abdo1, Axel Cosmus Pyndt Diederichsen1,4
1Dept. of Clinical Research, University of Southern Denmark, DK; 2Dept. of Nuclear Medicine, Odense University Hospital, DK; 3Dept. of Cardiothoracic and Vascular Surgery, Odense University Hospital, DK; 4Dept. of Cardiology, Odense University Hospital, DK

Aims: Coronary artery calcification (CAC) measured on cardiac CT is an important risk marker for cardiovascular disease (CVD), and has been included in the prevention guidelines. The aim of this study was to describe CAC score reference values and to develop a free available CAC calculator in the middle-aged and elderly population. This work updates two previously published landmark studies on CAC score reference values, the American MESA study and the German HNR study [1,2]. Differences in curve-derivation compared to a recently published pooled analysis are discussed [3].

Methods: 17,252 participants from two population-based cardiac CT screening cohorts (DanRisk and DANCAVAS) were included [4,5]. The CAC score was measured as a part of s screening session. Positive CAC scores were log-transformed and nonparametrically regressed on age for each gender, and percentile curves were transposed according to proportions of zero CAC scores.

Results: Men had higher CAC scores than women, and the prevalence and extend of CAC increased steadily with age. An online CAC calculator was developed, After entering sex, age and CAC score, the CAC score percentile and the coronary age are depicted including a figure with the specific CAC score and 25%, 50%, 75% and 90% percentiles. The specific CAC score can be compared to the entire background population or only those without prior CVD.

Conclusion: This study provides modern population-based reference values of CAC scores in men and woman, and a freely accessible online CAC calculator. Physicians and patients are very familiar with blood pressure and lipids, but unfamiliar with CAC scores. Using the calculator makes it easy to see if a CAC value is low, moderate or high, when a physician in the future communicates and discusses a CAC score with a patient.


[1] Schmermund A et al. Population-based assessment of subclinical coronary atherosclerosis using electron-beam computed tomography. Atherosclerosis 2006;185(1):177-182.

[2] McClelland RL et al. Distribution of coronary artery calcium by race, gender, and age: results from the Multi-Ethnic Study of Atherosclerosis (MESA). Circulation 2006;113(1):30-37.

[3] de Ronde MWJ et al. A pooled-analysis of age and sex based coronary artery calcium scores percentiles. J Cardiovasc Comput Tomogr. 2020;14(5):414-420.

[4] Diederichsen AC et al. Discrepancy between coronary artery calcium score and HeartScore in middle-aged Danes: the DanRisk study. Eur J Prev Cardiol 2012;19(3):558-564.

[5] Diederichsen AC et al. The Danish Cardiovascular Screening Trial (DANCAVAS): study protocol for a randomized controlled trial. Trials 2015;16:554.