Assessment of methods to deal with delayed treatment effects in immunooncology trials with time-to-event endpoints
Rouven Behnisch, Johannes Krisam, Meinhard Kieser
Institute of Medical Biometry and Informatics, University of Heidelberg, Germany
In cancer drug research and development, immunotherapy plays an ever more important role. A common feature of immunotherapies is a delayed treatment effect which is quite challenging when dealing with time-to-event endpoints . In case of time-to-event endpoints, regulatory authorities often require a log-rank test, which is the standard statistical method. The log-rank test is known to be most powerful under proportional-hazards alternatives but suffers a substantial loss in power if this assumption is violated. Hence, a rather long follow-up period is required to detect a significant effect in immunooncology trials. For that reason, the question arises whether methods exist that are more susceptible to delayed treatment effects and that can be applied early on to generate evidence anticipating the final decision of the log-rank test to reduce the trial duration without inflation of the type I error. Alternative methods include, for example, weighted log-rank statistics with weights that can either be fixed at the design stage of the trial  or chosen based on the observed data  or tests based on the restricted mean survival time , survival proportions, accelerated failure time (AFT) models or additive hazard models.
We evaluate and compare these different methods systematically with regard to type I error control and power in the presence of delayed treatment effects. Our simulation study includes aspects such as different censoring rates and types, different times of delay, and different failure time distributions. First results show that most methods achieve type I error rate control and that, by construction, the weighted log-rank tests which place more weight on late time points have a greater power to detect differences when the treatment effect is delayed. It is furthermore investigated whether and to what extent these methods can be applied at an early stage of the trial to predict the decision of the log-rank test later on.
 T. Chen (2013): Statistical issues and challenges in immuno-oncology. Journal for ImmunoTherapy of Cancer 1:18
 T.R. Fleming and D.P. Harrington (1991): Counting Processes and Survival Analysis. New York [u.a.]: Wiley-Interscience Publ.
 D. Magirr and C. Burman (2019): Modestly weighted logrank tests. Statistics in Medicine 38(20):3782-3790.
 P. Royston and M.K.B. Parmar (2013): Restricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome. BMC Medical Research Methodology 13(1):152.
Sampling designs for rare time-dependent exposures – A comparison of the nested exposure case-control design and exposure density sampling
Jan Feifel1, Maja von Cube2, Martin Wolkewitz2, Jan Beyersmann1, Martin Schumacher2
1Institute of Statistics, Ulm University, Germany; 2Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center University of Freiburg, Germany
Hospital-acquired infections increase both morbidity and mortality of hospitalized patients. Researchers interested in the effect of these time-dependent infections on the length-of-hospital stay, as a measure of disease burden, face large cohorts with possibly rare exposures.
For large cohort studies with rare outcomes nested case-control designs are favorable due to the efficient use of limited resources. Here, nested case-control designs apply but do not lead to reduced sample sizes, because the outcome is not necessarily rare, but the exposure is. Recently, exposure density sampling (EDS) and nested exposure case-control design (NECC)  have been proposed to sample for a rare time-dependent exposure in cohorts with a survival endpoint. The two designs differ in the time point of sampling.
Both designs enable efficient hazard ratio estimation by sampling all exposed individuals but only a small fraction of the unexposed ones. Moreover, they account for time-dependent exposure to avoid immortal time bias. We investigate and compare their performance using data of patients hospitalized in the neuro-intensive care unit at the Burdenko Neurosurgery Institute (NSI) in Moscow, Russia. The impact of different types of hospital-acquired infections with different prevalence on length-of-stay is considered. Additionally, inflation factors, a primary performance measure, are discussed. All presented methods will be compared to the gold-standard Cox model on the full cohort. We enhance both designs to allow for a competitive analysis of combined and competing endpoints. Additionally, these designs substantially reduce the amount of necessary information compared to the full cohort approach.
Both EDS and NECC are capable of analyzing time-to-event data by simultaneously accounting for rare time-dependent exposure and result in affordable sample sizes. EDS outperforms the NECC concerning efficiency and accuracy in most considered settings for combined endpoints. For competing risks, however, a tailored NECC shows more appealing results.
 K. Ohneberg, J. Beyersmann and M. Schumacher (2019): Exposure density sampling: Dynamic matching with respect to a time-dependent exposure. Statistics in Medicine, 38(22):4390-4403.
 J. Feifel, M. Gebauer, M. Schumacher and J. Beyersmann (2020): Nested exposure case-control sampling: a sampling scheme to analyze rare time-dependent exposures. Lifetime Data Analysis, 26:21-44.
Tumour-growth models improve progression-free survival estimation in the presence of high censoring-rates
Gabriele Bleckert, Hannes Buchner
Staburo GmbH, Germany
In oncology, reliable estimates of progression-free survival (PFS) are of highest importance because of high failure rates of phase III trials (around 60%). However, PFS estimations on early readouts with less than 50% of events observed do not use all available information from tumour measurements over time.
We project the PFS-event of each censored patient by using a mixed model  describing the tumour burden over time. RECIST-criteria are applied on estimated patient-specific non-linear tumour-trajectories to calculate the projected time-to-progression.
PFS is compared between test and reference by hazard ratios (HR). Several phase III and II simulations with 1000 runs each with 2000 or 80 patients, 6 months accrual and 2 (scenario-1) or 6 months (scenario-2) follow-up were performed. All simulations are based on a published optimal parameterisation  of tumour-growth in non-small-cell lung cancer (NSCLC) which implies a time-dependent HR.
The classical PFS estimation resulted in a HR of 0.34 (95%-percentiles: 0.29-0.40) for scenario-1 and 0.52 (0.47-0.58) for scenario-2 compared to a predicted HR of 0.77 for both scenarios (0.69-0.85 and 0.69-0.84), while the overall true HR (over ten years) was 0.78 (0.69-0.85). For 6, 12 and 120 months the time varying HRs were 0.41 (0.36-0.47), 0.60 (0.54-0.66) and 0.77 (0.69-0.85). The classical PFS estimation for phase II showed HRs from 0.52 to 0.61 compared to predicted HRs between 0.71 and 0.77.
Tumour-growth models improve PFS estimations in the presence of high censoring-rates as they consistently provide far better estimates of the overall true HR in phase III and II trials.
 M. Reck, A. Mellemgaard, S. Novello, PE. Postmus, B. Gaschler-Markefski, R. Kaiser, H. Buchner: Change in non-small-cell lung cancer tumor size in patients treated with nintedanib plus docetaxel: analyses from the Phase III LUME-Lung 1 study, OncoTargets and Therapy 2018:11 4573–4582
 Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982 Dec;38(4):963-74. PMID: 7168798.
Evaluation of event rate differences using stratified Kaplan-Meier difference estimates with Mantel-Haenszel weights
Hannes Buchner1, Stephan Bischofberger1, Rainer-Georg Goeldner2
1Staburo, Germany; 2Boehringer Ingelheim, Germany
The assessment of differences in event rates is a common endeavor in the evaluation of the efficacy of new treatments in clinical trials. We investigate the performance of different hypothesis tests for cumulative hospitalization or death rates of Covid-19 in order to reliably determine the efficacy of a novel treatment. The focus of the evaluation was on the comparison of event rates via Kaplan-Meier estimates for a pre-specified day and the aim to reduce sampling error, hence we examine different stratum weights for a stratified Z-test for Kaplan-Meier differences. The simulated data is calibrated from recent research on neutralizing antibodies treatment for Covid-19 with 2, 4, and 6 strata of different size and prevalence, and we investigate the effects of overall event rates ranging from 2% to 20%. We simulate 1000 patients and compare the results of 1000 simulation runs. Our simulation study shows superior performance of Mantel-Haenszel-type weights [Greenland & Robins (1985), Biometrics 41, 55-68] over inverse variance weights – in particular for unequal stratum sizes and very low event rates in some strata as common in COVID-19 treatment studies. The advantage of this approach is a larger power of the test (e.g. 79% instead of 64% for an average event rate 7%). The results are compared with those of a Cochran-Mantel-Haenszel (CMH) test, which yields lower power than the inverse variance weights for low event rates (under 62% for average event rate 7%) and consistently lower power than the Z-test with the Mantel-Haenszel stratum weights. Moreover, the CMH test breaks down (power reduction by 30%) in presence of loss-to-follow-up with as little as 5% of the patients due to its nature of not being designed for time-to-event data. The performance of the Z-test for Kaplan-Meier differences on the other hand is hardly suffering from the latter (power reduction by 4%). All investigated tests satisfy the set significance level for type-I errors in our simulation study.