Opportunities and limits of optimal group-sequential designs
Maximilian Pilz1, Carolin Herrmann2,3, Geraldine Rauch2,3, Meinhard Kieser1
1Institute of Medical Biometry and Informatics – University of Heidelberg, Germany; 2Institute of Biometry and Clinical Epidemiology – Charité University Medicine Berlin, Germany; 3Berlin Institute of Health
Multi-stage designs for clinical trials are becoming increasingly popular. There are two main reasons for this development. The first is the flexibility to modify the study design during the ongoing trial. This possibility is highly beneficial to avoid the failure of trials whose planning assumptions were enormously wrong. However, an unplanned design modification mid-course can also be performed in a clinical trial that has initially been planned without adaptive elements as long as the conditional error principle is applied. The second reason for the popularity of adaptive designs is the performance improvement that arises by applying a multi-stage design. For instance, an adaptive two-stage design can enormously reduce the expected sample size of a trial compared to a single-stage design.
With regard to this performance reason, a two-stage design can entirely be pre-specified before the trial starts. While this still leaves open the option to modify the design, it is preferred by regulatory authorities. Recent work treats the topic of optimal adaptive designs. While those show the best possible performance, it may be difficult to communicate them to a practitioner and to outline them in a study protocol.
To overcome this problem, simpler optimal group-sequential designs may be an option worth to be considered. Those only consist of two sample sizes (stage one and stage two) and three critical values (early futility, early efficacy, final analysis). Thus, they can easily be described and communicated.
In this talk, we present a variety of examples to investigate whether optimal group-sequential designs are a valid approximation of optimal adaptive designs. We elaborate design properties that can be fulfilled by optimal group-sequential designs without considerable performance deterioration and describe situations where an optimal adaptive design may be more appropriate. Furthermore, we give recommendations of how to specify an optimal two-stage design in the study protocol in order to motivate their application in clinical trials.
Group Sequential Methods for Nonparametric Relative Effects
Claus Peter Nowak1, Tobias Mütze2, Frank Konietschke1
1Charité – Universitätsmedizin Berlin, Germany; 2Novartis Pharma AG, Basel, Switzerland
Late phase clinical trials are occasionally planned with one or more interim analyses to allow for early termination or adaptation of the study. While extensive theory and software has been developed for normal, binary and survival endpoints, there has been comparatively little discussion in the group sequential literature on nonparametric methods outside the time-to-event setting. Focussing on the comparison of two parallel treatment arms, we show that the Wilcoxon-Mann-Whitney test, the Brunner-Munzel test, as well as a test procedure based on the log win odds, a modification of the win ratio, asymptotically follow the canonical joint distribution. Consequently, standard group sequential theory can be applied to plan, analyse and adapt clinical trials based on nonparametric efficacy measures. In addition, simulation studies examining type I error rates confirm the adequacy of the proposed methods for a range of scenarios. Lastly, we apply our methodology to the FREEDOMS clinical trial (ClinicalTrials.gov Identifier: NCT00289978), analysing relapse in patients with relapse-remitting multiple sclerosis.
Optimal futility stops in two-stage group-sequential gold-standard designs
Jan Meis, Maximilian Pilz, Meinhard Kieser
Institute of Medical Biometry and Informatics, University of Heidelberg, Heidelberg, Germany
A common critique of non-inferiority trials comparing an experimental treatment to an active control is that they may lack assay sensitivity. This denotes the ability of a trial to distinguish an effective treatment from an ineffective one. The ‚gold-standard‘ non-inferiority trial design circumvents this concern by comparing three groups in a hierarchical testing procedure. First, the experimental treatment is compared to a placebo group in an effort to show superiority. Only if this succeeds, the experimental treatment is tested for non-inferiority against an active control group. Ethical and practical considerations require sample sizes of clinical trials to be as large as necessary, but as small as possible. These considerations come especially pressing in the gold-standard design, as patients are exposed to placebo doses while the control treatment is already known to be effective.
Group sequential trial designs are known to reduce the expected sample size under the alternative hypothesis. In their pioneer work, Schlömer and Brannath (2013) show that the gold-standard design is no exception to this rule. In their paper, they calculate approximately optimal rejection boundaries for the gold-standard design given sample size allocation ratios of the optimal single stage design. We extend their work by relaxing the constraints put on the group allocation ratios and allowing for futility stops at interim. The futility boundaries and the sample size allocation ratios will be considered as optimization parameters, together with the efficacy boundaries. This allows the investigation of the efficiency gain by including the option to stop for futility. Allowing discontinuation of a trial when faced with underwhelming results at an interim analysis has very practical implications in saving resources and sparing patients from being exposed to ineffective treatment. In the gold-standard design, these considerations are especially pronounced. There is a large incentive to prevent further patients from being exposed to placebo treatment when interim results suggest that a confirmatory result in the final analysis becomes unlikely.
Besides the extended design options, we analyse different choices of optimality criteria. The above considerations suggest that the null hypothesis also plays an important role in the judgement of the gold-standard design. Therefore, optimality criteria that incorporate the design performance under the alternative and the null hypothesis are introduced. The results of our numerical optimization procedure for this extended design will be discussed and compared to the findings of Schlömer and Brannath.
Blinded sample size re-estimation in a paired diagnostic study
Maria Stark, Antonia Zapf
University Medical Center Hamburg-Eppendorf, Germany
In a paired confirmatory diagnostic accuracy study, a new experimental test is compared within the same patients to an already existing comparator test. The gold standard defines the true disease status. Hence, each patient undergoes three diagnostic procedures. If feasible and ethically acceptable, regulatory agencies prefer this study design to an unpaired design (CHMP, 2009). The initial sample size calculation is based on assumptions about, among others, the prevalence of the disease and the proportion of discordant test results between the experimental and the comparator test (Miettinen, 1968).
To adjust these assumptions during the study period, an adaptive design for a paired confirmatory diagnostic accuracy study is introduced. This adaptive design is used to re-estimate the prevalence and the proportion of discordant test results to finally re-calculate the sample size. It is a blinded adaptive design as the sensitivity and the specificity of the experimental and comparator test are not re-estimated. Due to the blinding, the type I error rates are not inflated.
An example and a simulation study illustrate the adaptive design. The type I error rate, the power and the sample size of the adaptive design are compared to those of a fixed design. Both designs hold the type I error rate. The adaptive design reaches the advertised power. The fixed design can either be over-or underpowered depending on a possibly wrong assumption regarding the sample size calculation.
The adaptive design compensates inefficiencies of the sample size calculation and therefore it supports to reach the desired study aim.
 Committee for Medicinal Products for Human Use, Guideline on clinical evaluation of diagnostic agents. Available at: http://www. ema.europa.eu. 2009, 1-19.
 O. S. Miettinen, The matched pairs design in the case of all-or-none responses. 24 2 1968, 339-352.
Sample size calculation and blinded re-estimation for diagnostic accuracy studies considering missing or inconclusive test results
Cordula Blohm1, Peter Schlattmann2, Antonia Zapf3
1Universität Heidelberg; 2Universitätsklinikum Jena; 3Universitätsklinikum Hamburg-Eppendorf
For diagnostic accuracy studies the two independent co-primary endpoints, sensitivity and specificity, are relevant. Both parameters are calculated based on the disease status and the test result that is evaluated either as positive or negative. Sometimes the test result is neither positive nor negative but inconclusive or even missing. There are four frequently used methods of handling missing values available where such results are counted as missing, positive, negative, or false positive and false negative. The first three approaches may lead to an overestimation of both parameters, or either sensitivity or specificity, respectively. In the fourth approach, the intention to diagnose principle (ITD), both parameters decrease and a more realistic picture of the clinical potential of diagnostic tests is provided (Schuetz et al. 2012).
Sensitivity and specificity are also key parameters in sample size calculation of diagnostic accuracy studies and a realistic estimate of them is mandatory for the success of a trial. Therefore, the consideration of inconclusive results in the initial sample size calculation and, especially, in a blinded sample size re-calculation based on an interim analysis could improve trial design.
For sample size calculation, the minimum sensitivity and specificity, the type I error rate and the power are defined. In addition, the expected sensitivity and specificity of the experimental test, the prevalence, and the proportion of inconclusive results are assumed. For the simulation study different scenarios are chosen by varying these parameters. The optimal sample size is calculated according to Stark and Zapf (2020). The inconclusive results are generated independently of disease status and randomly distributed over diseased and non-diseased subjects. The sensitivity and specificity of the experimental test are estimated while considering the four different methods, mentioned above, to handle inconclusive results.
The sample size re-recalculation is performed with a blinded one-time re-estimation of the proportion of inconclusive results. The power, the type I error rate, and the bias of estimated sensitivity and specificity are used as performance measures.
The simulation study aims to evaluate the influence of inconclusive results on the evaluation of diagnostic test accuracy in an adaptive study design. The performance difference of the four methods to handle inconclusive results will be discussed.