Multiple Testing

Chairs: Arne Bathke and Robert Kwiecien

Analysis and sample size calculation for a conditional survival model with a binary surrogate endpoint
Samuel Kilian, Johannes Krisam, Meinhard Kieser
Institute of Medical Biometry and Informatics; University Heidelberg; Heidelberg, Germany

The primary endpoint in oncology is usually overall survival, where differences between therapies may only be observable after many years. To avoid withholding of a promising therapy, preliminary approval based on a surrogate endpoint is possible in certain situations (Wallach et al., 2018). The approval has to be confirmed later when overall survival can be assessed. When this is done within the same study, the correlation between surrogate endpoint and overall survival has to be taken into account for sample size calculation and analysis. This relation can be modeled by means of a conditional survival model which was proposed by Xia et al. (2014). They investigated the correlation and assessed power of the logrank test but did not develop methods for statistical testing, parameter estimation, and sample size calculation.

In this talk, a new statistical testing procedure based on the conditional model and Maximum Likelihood (ML) estimators for its parameters will be presented. An asymptotic test for survival difference will be given and an approximate sample size formula will be derived. Furthermore, an exact test for survival difference and an algorithm for exact sample size determination will be provided. Type I error rate, power, and required sample size for both newly developed tests will be determined exactly. Sample sizes will be compared to those required for the logrank test.

It will be shown that for small sample sizes the asymptotic parametric test and the logrank test exceed the nominal significance level under the conditional model. For a given sample size, the power of the asymptotic and the exact parametric test is similar, whereas the power of the logrank test is considerably lower in many situations. The other way round, the sample size needed to attain a prespecified power is comparable for the asymptotic and the exact parametric test, but considerably higher for the logrank test in many situations.

We conclude that the presented exact test performs very well under the assumptions of the conditional model and is a better choice than the asymptotic parametric test or the logrank test, respectively. Furthermore, the talk will give some insights in performing exact calculations for parametric survival time models. This provides a fast and powerful method to evaluate parametric tests for survival difference, thus facilitating the planning, conduct, and analysis of oncology trials with the option of accelerated approval.

The max-t Test in High-Dimensional Repeated Measures and Multivariate Designs
Frank Konietschke
Charite Berlin, Germany

Repeated measures (and multivariate) designs occur in a variety of different research areas. Hereby, the designs might be high-dimensional, i.e. more (possibly)

dependent than independent replications of the trial are observed. In recent years, several global testing procedures (studentized quadratic forms) have been proposed for the analysis of such data. Testing global null hypotheses, however, usually does not answer the main question of practitioners, which is the specific localization of significant time points or group*time interactions. The use of max-t tests on the contrary, can provide this important information. In this talk, we discuss its applicability in such designs. In particular, we approximate the distribution of the max t-test statistic using innovative resampling strategies. Extensive simulation studies show that the test is particularly suitable for the analysis of data sets with small sample sizes . A real data set

illustrates the application of the method.

Graphical approaches for the control of generalized error rates
Frank Bretz1, David Robertson2, James James Wason3
1Novartis, Switzerland; 2University of Cambridge, UK; 3Newcastle University, UK

When simultaneously testing multiple hypotheses, the usual approach in the context of confirmatory clinical trials is to control the familywise error rate (FWER), which bounds the probability of making at least one false rejection. In many trial settings, these hypotheses will additionally have a hierarchical structure that reflects the relative importance and links between different clinical objectives. The graphical approach of Bretz et al (2009) is a flexible and easily communicable way of controlling the FWER while respecting complex trial objectives and multiple structured hypotheses. However, the FWER can be a very stringent criterion that leads to procedures with low power, and may not be appropriate in exploratory trial settings. This motivates controlling generalized error rates, particularly when the number of hypotheses tested is no longer small. We consider the generalized familywise error rate (k-FWER), which is the probability of making k or more false rejections, as well as the tail probability of the false discovery proportion (FDP), which is the probability that the proportion of false rejections is greater than some threshold. We also consider asymptotic control of the false discovery rate, which is the expectation of the FDP. In this presentation, we show how to control these generalized error rates when using the graphical approach and its extensions. We demonstrate the utility of the resulting graphical procedures on clinical trial case studies.

Statistical Inference for Diagnostic Test Accuracy Studies with Multiple Comparisons
Max Westphal1, Antonia Zapf2
1Fraunhofer Institute for Digital Medicine MEVIS, Bremen, Germany; 2Institute of Medical Biometry and Epidemiology, UKE Hamburg, Hamburg, Germany

Diagnostic accuracy studies are usually designed to assess the sensitivity and specificity of an index test in relation to a reference standard or established comparative test. This so-called co-primary endpoint analysis has recently been extended to the case that multiple index tests are investigated [1]. Such a design is relevant in modern applications where many different (machine-learned) classification rules based on high dimensional data are considered initially as the final model selection can (partially) be based on data from the diagnostic accuracy study.

In this talk, we motivate the according hypothesis problem and propose different multiple test procedures for that matter. Besides classical parametric corrections (Bonferroni, maxT) we also consider Bootstrap approaches and a Bayesian procedure. We will present early findings from a simulation study to compare the (family-wise) error rate and power of all procedures.

A general observation from the simulation study is the wide variability of rejection rates under different (realistic and least-favorable) parameter configurations. We discuss these findings and possible future extensions of our numerical experiments. All methods have been implemented in a new R package which will also be introduced briefly.


1. Westphal, Max, Antonia Zapf, and Werner Brannath. „A multiple testing framework for diagnostic accuracy studies with co-primary endpoints.“ arXiv preprint arXiv:1911.02982 (2019).