## Location: Zoom-Session

### Keynote: Estimands and Causality / Closing Session

**Estimands and Causality**

Daniel Scharfstein*Johns Hopkins Bloomberg School of Public Health, USA*

Closing: Andreas Faldum, Werner Brannath / Annette Kopp-Schneider

### Mathematical Methods in Medicine and Biology

**Future Prevalence of Type 2 Diabetes – A Comparative Analysis of Chronic Disease Projection Methods**

Dina Voeltz^{1}, Thaddäus Tönnies^{2}, Ralph Brinks^{1,2,3}, Annika Hoyer^{1}^{1}Ludwig-Maximilians-Universität München, Germany; ^{2}Institute for Biometrics and Epidemiology, German Diabetes Center, Leibniz Institute for Diabetes Research at Heinrich-Heine-University Duesseldorf; ^{3}Hiller Research Unit for Rheumatology Duesseldorf

Background: Precise projections of future chronic disease cases needing pharmaco-intensive treatments are necessary for effective resource allocation and health care planning in response to increasing disease burden.

Aim: To compare different projection methods to estimate the number of people diagnosed with type 2 diabetes (T2D) in Germany in 2040.

Methods: We compare the results of three methods to project the number of people with T2D in Germany 2040. In a relatively simple approach, method 1) combines the sex- and age-specific prevalence of T2D in 2015 with sex- and age-specific population distributions projected by the German Federal Statistical Office (FSO). Methods 2) and 3) additionally account for incidence of T2D and mortality rates using mathematical relations as proposed by the illness-death model for chronic diseases [1]. Therefore, they are more comprehensive than method 1), which likely adds to their results’ validity and accuracy. For this purpose, method 2) firstly models the prevalence of T2D employing a partial differential equation (PDE) which incorporates incidence and mortality [2]. This flexible, yet simple PDE used yields is validated in contexts of dementia, amongst others, and is recommended for chronic disease epidemiology. Subsequently, the estimated prevalence is multiplied with the population projection of the FSO [3]. Hence, method 2) uses the projected general mortality of the FSO and the mortality rate ratio of diseased vs. non-diseased people. By contrast, method 3) estimates future mortality of non-diseased and diseased people independently from the projection of the FSO. These estimated future mortality rates function as input for two PDEs to directly project the absolute number of cases. The sex- and age-specific incidence rate for methods 2) and 3) stems from the risk structure compensation (Risikostrukturausgleich, MorbiRSA) which comprises data from about 70 million Germans in the public health insurance. The incidence rate is assumed to remain as in 2015 throughout the overall projection horizon from 2015 to 2040.

Results: Method 1) projects 8.3 million people with diagnosed T2D in Germany in 2040. Compared to 6.9 million people in 2015, this equals an increase by 21%. Methods 2) and 3) project 11.5 million (+65% compared to 2015) and 12.5 million (+85%) T2D patients, respectively.

Conclusions: The methods’ results differ substantially. Method 1) accounts for the aging of the German population but is otherwise relatively little comprehensive. Method 2) and 3) additionally consider underlying changes in the incidence and mortality rates affecting disease prevalence.

**Mixed-effects ANCOVA for estimating the difference in population mean parameters in case of nonlinearly related data**

Ricarda Graf*University of Göttingen, Germany*

Repeated measures data can be found in many fields. The two types of variation characteristic for this type of data – referred to as within-subject and between-subject variation – are accounted for by linear and nonlinear mixed-effects models. ANOVA-type models are sometimes applied for comparison of population means despite a nonlinear relationship in the data. Accurate parameter estimation through more appropriate nonlinear-mixed effects (NLME) models, such as for sigmoidal curves, might be hampered due to insufficient data near the asymptotes, the choice of starting values for the iterative optimization algorithms used given the lack of closed-form expressions of the likelihood or due to convergence problems of these algorithms.

The main objective of this thesis is to compare the performance of a one-way mixed-effects ANCOVA and a NLME three-parameter logistic regression model with respect to the accuracy in estimating the difference in population means. Data from a clinical trial1, in which the difference in mean blood pressure (BP50) between two groups was estimated by repeated-measures ANOVA, served as a reference for data simulation. A third simplifying method, used in toxicity studies², was additionally included. It considers the two measurements per subject lying immediately below and above mean half maximal response (E_max). Population means are obtained by considering the intersections of the horizontal line represented by half E_max and the line derived from connecting the two data points per subject and group. A simulation study with two scenarios was conducted to compare bias, coverage rates and empirical SE of the three methods when estimating the difference in BP50 for purpose of identification of the disadvantages by using the simpler linear instead of the nonlinear model. In the first scenario, the true individual blood pressure ranges were considered, while in the second scenario, measurements at characteristic points of the sigmoidal curves were considered, regardless of the true measurement ranges, in order to obtain a more distinct nonlinear relationship.

The estimates of the mixed-effects ANCOVA model were more biased but also more precise compared with the NLME model. The ANCOVA method could not detect the difference in BP50 in the second scenario anymore. The results of the third method did not seem reliable since its estimates did on average even reverse the direction of the true parameter.

NLME models should be preferred for data with a known nonlinear relationship if the available data allows it. Convergence problems can be overcome by using a Bayesian approach.

**Explained Variation in the Linear Mixed Model**

Nicholas Schreck*DKFZ Heidelberg, Germany*

The coefficient of determination is a standard characteristic in linear models with quantitative response variables. It is widely used to assess the proportion of variation explained, to determine the goodness-of-fit and to compare models with different covariates.

However, there has not been an agreement on a similar quantity for the class of linear mixed models yet.

We introduce a natural extension of the well-known adjusted coefficient of determination in linear models to the variance components form of the linear mixed model.

This extension is dimensionless, has an intuitive and simple definition in terms of variance explained, is additive for several random effects and reduces to the adjusted coefficient of determination in the linear model.

To this end, we prove a full decomposition of the sum of squares of the independent variable into the explained and residual variance.

Based on the restricted maximum likelihood equations, we introduce a novel measure for the explained variation which we allocate specifically to the contribution of the fixed and the random covariates of the model.

We illustrate that this empirical explained variation can in particular be used as an improved estimator of the classical additive genetic variance of continuous complex traits.

**Modelling acute myeloid leukemia: Closing the gap between model parameters and individual clinical patient data**

Dennis Görlich*Institute of Biostatistics and Clinical Research, University Münster, Germany*

In this contribution, we will illustrate and discuss our approach to fit a mechanistic mathematical model of acute myeloid leukemia (AML) to individual patient data, leading to personalized model parameter estimates.

We use a previously published model (Banck and Görlich, 2019) that describes the healthy hematopoiesis and the leukemia dynamics. Here, we consider a situation where the healthy hematopoiesis is calibrated to a population average and personalized leukemia parameters (self renewal, proliferation, and treatment intensity) needs to be estimated.

To link the mathematical model to clinical data model predictions needs to be aligned to observable clinical outcome measures. In AML research, blast load, complete remission, and survival are typically considered. Based on the model’s properties, especially the capability to predict the considered outcomes, blast load turned out to be well suited for the model fitting process.

We formulated an optimization problem to estimate personalized model parameters based on the comparison between observed and predicted blast load (cf. Görlich, 2021).

A grid search was performed to evaluate the fitness landscape of the optimization problem. The grid search approach showed that, depending on the patient’s individual blast course, noisy fitness landscapes can occur. In these cases, a gradient-descent algorithm will usually perform poorly. This problem can be overcome by application of e.g. the differential evolution algorithm (Price et al., 2006). The estimated personalized leukemia parameters can be further correlated to observed clinical data. A preliminary analysis showed promising results.

Finally, the application of mechanistic mathematical models in combination with personalized model fitting seems to be a promising approach within clinical research.

References

Dennis Görlich (accepted). Fitting Personalized Mechanistic Mathematical Models of Acute Myeloid Leukaemia to Clinical Patient Data. Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies, Volume 3: BIOINFORMATICS 2021

Jan C. Banck and Dennis Görlich (2019). In-silico comparison of two induction regimens (7 + 3 vs 7 + 3 plus additional bone marrow evaluation) in acute myeloid leukemia treatment. BMC Systems Biology, 13(1):18.

Kenneth V. Price, Rainer M. Storn and Jouni A. Lampinen (2006). Differential Evolution – A Practical Approach to Global Optimization. Berlin Heidelberg: Springer-Verlag.

**Effect of missing values in multi-environmental trials on variance component estimates**

Jens Hartung, Hans-Peter Piepho*University of Hohenheim, Germany*

A common task in the analysis of multi-environmental trials (MET) by linear mixed models (LMM) is the estimation of variance components (VCs). Most often, MET data are imbalanced, e.g., due to selection. The imbalance mechanism can be missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR). If the missing-data pattern in MET is not MNAR, likelihood-based methods are the preferred methods for analysis as they can account for selection. Likelihood-based methods used to estimate VCs in LMM have the property that all VC estimates are constrained to be non-negative and thus the estimators are generally biased. Therefore, there are two potential causes of bias in MET analysis: a MNAR data pattern and the small-sample properties of likelihood-based estimators. The current study tries to distinguish between both possible sources of bias. A simulation study with MET data typical for cultivar evaluation trials was conducted. The missing data pattern and size of VCs was varied. The results showed that for the simulated MET, VC estimates from likelihood-based methods are mainly biased due to the small-sample properties of likelihood-based methods for a small ratio of genotype variance to error variance.

### Evidence Based Medicine and Meta-Analysis II

**Investigating treatment-effect modification by a continuous covariate in IPD meta-analysis: an approach using fractional polynomials**

Willi Sauerbrei^{1}, Patrick Royston^{2}^{1}Medical Center – University of Freiburg, Germany; ^{2}MRC Clinical Trials Unit at UCL, London, UK

Context: In clinical trials, there is considerable interest in investigating whether a treatment effect is similar in all patients, or that some prognostic variable indicates a differential response to treatment. To examine this, a continuous predictor is usually categorised into groups according to one or more cutpoints. Several weaknesses of categorisation are well known.

Objectives: To avoid the disadvantages of cutpoints and to retain full information, it is preferable to keep continuous variables continuous in the analysis. The aim is to derive a statistical procedure to handle this situation when individual patient data (IPD) are available from several studies.

Methods: For continuous variables, the multivariable fractional polynomial interaction (MFPI) method provides a treatment effect function (TEF), that is, a measure of the treatment effect on the continuous scale of the covariate (Royston and Sauerbrei, Stat Med 2004, 2509‐25). MFPI is applicable to most of the popular regression models, including Cox and logistic regression. A meta‐analysis approach for averaging functions across several studies has been proposed (Sauerbrei and Royston, Stat Med 2011, 3341‐60). A first example combining these two techniques (called metaTEFs) was published (Kasenda et al, BMJ Open 2016; 6:e011148). Another approach called meta-stepp was proposed (Wang et al, Stat Med 2016, 3704- 16). Using the data from Wang (8 RCTs in patients with breast cancer) we will illustrate various issues of our metaTEFs approach.

Results and Conclusions: We used metaTEFs to investigate a potential treatment effect modifier in a meta‐analysis of IPD from eight RCTs. In contrast to cutpoint‐based analyses, the approach avoids several critical issues and gives more detailed insight into how the treatment effect is related to a continuous biomarker. MetaTEFs retains the full information when performing IPD meta‐analyses of continuous effect modifiers in randomised trials. Early experience suggests it is a promising approach.

**Standardisierte Mittelwertdifferenzen aus Mixed Model Repeated Measures – Analysen**

Lars Beckmann, Ulrich Grouven, Guido Skipka*Institut für Qualität und Wirtschaftlichkeit im Gesundheitswesen (IQWiG), Deutschland*

In klinischen Studien werden für Patientinnen und Patienten häufig Daten zur gesundheitsbezogenen Lebensqualität und zur Symptomatik zu aufeinanderfolgenden Zeitpunkten erhoben. Für die Auswertung dieser longitudinalen Daten werden in der Literatur lineare gemischte Modelle für Messwiederholungen (Mixed Models Repeated Measures – Modelle [MMRM]) vorgeschlagen. Diese Endpunkte werden in der Regel mit Skalen mit nicht natürlichen Einheiten gemessen.

Es liegt nahe, für die Bestimmung einer klinischen Relevanz oder für die Durchführung von Metaanalysen auf standardisierte Mittelwertdifferenzen (SMD), wie beispielsweise Cohens d oder Hedges’ g, zurückzugreifen. Allerdings ist unklar, wie die für die SMD benötigte gepoolte Standardabweichung aus MMRM – Analysen berechnet werden kann. Anhand einer Simulationsstudie wurden verschiedene Verfahren zur Schätzung einer SMD untersucht. Die Verfahren lassen sich unterteilen in Ansätze, die auf die im MMRM geschätzten Standardfehler der Mittelwertdifferenz (MD) zurückgreifen, und in Ansätze, die die individuellen Patientendaten (IPD) benutzen.

Simuliert wurden Daten einer randomisierten kontrollierten Studie. Die longitudinalen Daten wurden mittels eines autoregressiven Modells 1. Ordnung (AR) für die Abhängigkeiten zwischen den Erhebungszeitpunkten simuliert. Parameter für die Simulationen waren die SMD, die Varianz für die Änderung zum Ausgangswert, die Korrelation für das AR sowie die Stichprobengrößen in den Therapiearmen. Der betrachtete Endpunkt ist die Differenz zwischen den Therapiearmen hinsichtlich der mittleren Änderung zum Ausgangswert über den gesamten Studienverlauf. Die verschiedenen Verfahren wurden bezüglich Überdeckungswahrscheinlichkeit, Verzerrung, Mean Squared Error (MSE), Power und Fehler 1. Art sowie Konkordanz von MD und SMD bez. der statistischen Signifikanz und der Überdeckung des wahren Effektes verglichen.

Die Verfahren, bei denen die gepoolte Standardabweichung aus Standardfehlern des MMRM berechnet wird, zeigen Verzerrungen, die zu einer deutlichen Überschätzung des wahren Effektes führen. Verfahren, die die gepoolte Standardabweichung aus den beobachteten Veränderungen zum Studienanfang schätzen, zeigen eine deutlich geringere Verzerrung und einen geringeren MSE. Allerdings ist die Power, im Vergleich zur MD, kleiner.

Die Schätzung einer SMD mittels der Standardfehler aus dem MMRM ist nicht angemessen. Dies ist insbesondere bei der Bewertung von großen SMDs zu berücksichtigen. Zu einer angemessenen Schätzung einer SMD sind Verfahren notwendig, aus denen die gepoolte Standardabweichung der Änderung zum Ausgangswert mit IPD geschätzt werden kann.

**Robust Covariance Estimation in Multivariate Meta-Regression**

Thilo Welz *TU Dortmund University, Germany*

Univariate Meta-Regression (MR) is an important technique for medical and psychological research and has been deeply researched. Its multivariate counterpart, however, remains less explored. Multivariate MR holds the potential to incorporate the dependency structure of multiple effect measures as opposed to performing multiple univariate analyses. We explore the possibilities for robust estimation of the covariance of the coefficients in our multivariate MR model. More specifically, we extend heteroscedasticity consistent (also called sandwich or HC-type) estimators from the univariate to the multivariate context. These, along with the Knapp-Hartung adjustment, proved useful in previous work (see Viechtbauer (2015) for an analysis of Knapp-Hartung and Welz & Pauly (2020) for HC-estimators in univariate MR). In our simulations we focus on the bivariate case, which is important for incorporating secondary outcomes as in Copas et al. (2018), but higher dimensions are also possible. The validity of the considered robust estimators is evaluated based on the type-I-error and power of statistical tests based on these estimators. We compare our robust estimation approach with a classical (non-robust) procedure. Finally, we highlight some of the numerical and statistical issues we encountered and provide pointers for others wishing to employ these methods in their analyses.

**A Bayesian approach to combine rater assessments**

Lorenz Uhlmann^{1,2}, Christine Fink^{3}, Christian Stock^{2}, Marc Vandemeulebroecke^{1}, Meinhard Kieser^{2} * *^{1}Novartis Pharma AG, Basel, Switzerland; ^{2}Institute of Medical Biometry and Informatics, University of Heidelberg, Heidelberg, Germany; ^{3}Department of Dermatology, University Medical Center, Ruprecht-Karls University, Heidelberg, Germany

Background: Ideally, endpoints in clinical studies are objectively measurable and easy to assess. However, sometimes this is infeasible and alternative approaches based on (more subjective) rater assessments need to be considered. A Bayesian approach to combine such rater assessments and to estimate relative treatment effects is proposed. Methods: We focus on a setting where each subject is observed under the condition of every group and where one or multiple raters assign scores that constitute the endpoints. We further assume that the raters compare the arms in a pairwise way by simply scoring them on an individual subject-level. This setting has principle similarities to network meta-analysis where groups (or treatment arms) are ranked in a probabilistic fashion. Many ideas from this field, such as heterogeneity (within raters) or inconsistency (between raters), can be directly applied. We build on Bayesian methodology used in this field and derive models for normally distributed and ordered categorical scores which take into account an arbitrary number of raters and groups. Results: A general framework is created which is at the same time easy to implement and allows for a straightforward interpretation of the results. The method is illustrated with a real clinical study example on a computer-aided hair detection and removal algorithm in dermatoscopy. Raters assessed the image quality of pictures generated by the algorithm compared to pictures of unshaved and shaved nevis. Conclusion: A Bayesian approach to combine rater assessments based on an ordinal or continuous scoring system to compare groups in a pairwise fashion is proposed and illustrated using a real data example. The model allows to assess all pairwise comparisons among multiple groups. Since the approach is based on the well-established network meta-analysis methodology, many characteristics can be inferred from that methodology.

### Adaptive Designs III

**Adaptive group sequential survival comparisons based on log-rank and pointwise test statistics**

Jannik Feld, Andreas Faldum, Rene Schmidt*Institute of Biostatistics and Clinical Research, University of Münster*

Whereas the theory of confirmatory adaptive designs is well understood for uncensored data, implementation of adaptive designs in the context of survival trials remains challenging. Commonly used adaptive survival tests are based on the independent increments structure of the log-rank statistic. These designs suffer the limitation that effectively only the interim log-rank statistic may be used for design modifications (such as data-dependent sample size recalculation). Alternative approaches based on the patient-wise separation principle have the common disadvantage that the test procedure may either neglect part of the observed survival data or tends to be conservative. Here, we instead propose an extension of the independent increments approach to adaptive survival tests. We present a confirmatory adaptive two-sample log-rank test of no difference in a survival analysis setting, where provision is made for interim decision making based on both the interim log-rank statistic and/or pointwise survival-rates, while avoiding aforementioned problems. The possibility to include pointwise survival-rates eases the clinical interpretation of interim decision making and is a straight forward choice for seamless phase II/III designs. We will show by simulation studies that the performance does not suffer using the pointwise survival-rates and exemplary consider application of the methodology to a two-sample log-rank test with binding futility criterion based on the observed short-term survival-rates and sample size recalculation based on conditional power. The methodology is motivated by the LOGGIC Europe Trial from pediatric oncology. Distributional properties are derived using martingale techniques in the large sample limit. Small sample properties are studied by simulation.

**Single-stage, three-arm, adaptive test strategies for non-inferiority trials with an unstable reference**

Werner Brannath, Martin Scharpenberg, Sylvia Schmidt*University of Bremen, Germany*

For indications where only unstable reference treatments are available and use of placebo is ethically justified, three-arm `gold standard‘ designs with an experimental, reference and placebo arm are recommended for non-inferiority trials. In such designs, the demonstration of efficacy of the reference or experimental treatment is a requirement. They have the disadvantage that only little can be concluded from the trial if the reference fails to be efficacious. To overcome this, we investigate a novel single-stage, adaptive test strategies where non-inferiority is tested only if the reference shows sufficient efficacy and otherwise delta-superiority of the experimental treatment over placebo is tested. With a properly chosen superiority margin, delta-superiority indirectly shows non-inferiority. We optimize the sample size for several decision rules and find that the natural, data driven test strategy, which tests with non-inferiority if the reference’s efficacy test is significant, leads to the smallest overall and placebo sample sizes. Under specific constraints on the sample sizes, this procedure controls the family-wise error rate. All optimal sample sizes are found to meet this constraint. We finally show how to account for a relevant placebo drop-out rate in an efficient way and apply the new test strategy to a real life data set.

**Sample size re-estimation based on the prevalence in a randomized test-treatment study**

Amra Hot, Antonia Zapf*Institute of Medical Biometry and Epidemiology, University Medical Center Hamburg-Eppendorf, Hamburg*

Patient benefit should be the primary criterion in evaluating diagnostic tests. If a new test has shown sufficient accuracy, its application in clinical practice should yield to a patient benefit. Randomized test-treatment studies are needed to assess the clinical utility of a diagnostic test as part of a broader management regimen in which test-treatment strategies are compared in terms of their impact on patient relevant outcomes [1]. Due to their increased complexity compared to common intervention trials the implementation of such studies poses practical challenges which might affect the validity of the study. One important aspect is the sample size determination. It is a special feature of these designs that they combine information on the disease prevalence and accuracy of the diagnostic tests, i.e. sensitivity and specificity of the investigated tests, with assumptions on the expected treatment effect. Due to the lack of empirical information or uncertainty regarding these parameters sample size consideration will always be based on a rather weak foundation, thus leading to an over- or underpowered trial. Therefore, it is reasonable to consider adaptations in earlier phases of the trial based on a pre-specified interim analysis in order to solve this problem. A blinded sample size re-estimation based on the disease prevalence in a randomized test-treatment study was performed as part of a simulation study. The type I error, the empirical overall power as well as the bias of the estimated prevalence are assessed and presented.

References

[1] J. G. Lijmer, P.M. Bossuyt. Diagnostic testing and prognosis: the randomized controlled trial in test evaluation research. In: The evidence base of clinical diagnosis. Blackwell Oxford, 2009, 63-82.

**Performance evaluation of a new “diagnostic-efficacy-combination trial design” in the context of telemedical interventions**

Mareen Pigorsch^{1}, Martin Möckel^{2}, Jan C. Wiemer^{3}, Friedrich Köhler^{4}, Geraldine Rauch^{1}^{1}Charité – Universitätsmedizin Berlin, Institute of Biometry and clinical Epidemiology; ^{2}Charité – Universitätsmedizin Berlin, Division of Emergency and Acute Medicine, Cardiovascular Process Research; ^{3}Clinical Diagnostics, Thermo Fisher Scientific; ^{4}Charité – Universitätsmedizin Berlin, Centre for Cardiovascular Telemedicine, Department of Cardiology and Angiology

Aims:

Telemedical interventions in heart failure patients intend to avoid unfavourable, treatment-related events by an early, individualized care, which reacts to the current patients need. However, telemedical support is an expensive intervention and only patients with a high risk for unfavourable follow-up events will profit from telemedical care. Möckel et al. therefore adapted a “diagnostic-efficacy-combination design” which allows to validate a biomarker and investigate a biomarker-selected population within the same study. For this, cut-off values for the biomarkers were determined based on the observed outcomes in the control group to define a high-risk subgroup. This defines the diagnostic design step. These cut-offs were subsequently applied to the intervention and the control group to identify the high-risk subgroup. The intervention effect is then evaluated by comparison of these subgroups. This defines the efficacy design step. So far, it has not been evaluated if this double use of the control group for biomarker validation and efficacy comparison leads to a bias in treatment effect estimation. In this methodological research work, we therefore want to evaluate whether the “diagnostic-efficacy-combination design” leads to biased treatment effect estimates. If there is a bias, we further want to analyse its impact and the parameters influencing its size.

Methods:

We perform a systematic Monte-Carlo simulation study to investigate potential bias in various realistic trial scenarios that mimic and vary the true data of the published TIM‐HF2 Trial. In particular we vary the event rates, the sample sizes and the biomarker distributions.

Results:

The results show, that indeed the proposed design leads to some bias in the effect estimators, indicating an overestimation of the effect. But this bias is relatively small in most scenarios. The larger the sample size, the more the event rates differ in the control and the intervention group and the better the biomarker can separate the high-risk from the low-risk patients, the smaller is the resulting relative bias.

Conclusions:

The “diagnostic-efficacy-combination design” can be recommended for clinical applications. We recommend ensuring a sufficient large sample size.

Reference:

Möckel M, Koehler K, Anker SD, Vollert J, Moeller V, Koehler M, Gehrig S, Wiemer JC, von Haehling S, Koehler F. Biomarker guidance allows a more personalized allocation of patients for remote patient management in heart failure: results from the TIM-HF2 trial. Eur J Heart Fail. 2019;21(11):1445-58.

**Valid sample size re-estimation at interim**

Nilufar Akbari*Charité – Institute of Biometry and Clinical Epidemiology, Germany*

Throughout this work, we consider the situation of a two-arm controlled clinical trial based on time-to-event data.

The aim of this thesis is to estimate a meaningful survival model in a robust way to observed data during an interim analysis in order to carry out a valid sample size recalculation.

Adaptive designs provide an attractive possibility of changing study design parameters in an ongoing trial. There are still many open questions with respect to adaptive designs for time-to-event data. Among other things, this is because survival data, unlike continuous or binary data, undertake a follow-up phase, so that the outcome is not directly observed after patient’s treatment.

Evaluating survival data at interim analyses leads to a patient overrun since the recruitment is usually not stopped at the same time. Another problem is that there must be an interim analysis during this recruitment phase to save patients. Moreover, the timing of the interim analysis is a crucial point build decisions upon a reasonable level of information.

A general issue about time-to-event data is that at an interim analysis one can only calculate the updated size of the required number of events. However, there is normally a greater need in the determination of the sample size to achieve that required number of events. Therefore, the underlying event-time distribution is needed, which may possibly be estimated from the interim data.

This however, is a difficult task for the following reasons: The number of observed events at interim is limited, and the survival curve at interim is truncated by the interim time point.

The goal of this research work is to fit a reasonable survival model to the observed data in a robust way. The fitted curve has the following advantages: The underlying hazards per group can be estimated which allows updating the required number of patients for achieving the respective number of events. Finally, the impact of overrun can be directly assessed and quantified.

The following problems were additionally evaluated in detail. How much do the hazards deviate if the wrong event-time distribution was estimated? At which point in time is a sample size re-estimation useful, or rather how many events are required, for a valid sample size re-estimation at interim?

### Panel Discussion: Do we still need hazard ratios?

**Panel**

Jan Beyersmann (*Ulm University), * Oliver Kuß (Düsseldorf), Andreas Wienke (*Halle*)

**Do we still need hazard ratios?** **(I)**

Oliver Kuß*German Diabetes Center, Leibniz Institute for Diabetes Research at Heinrich Heine University Düsseldorf, Institute for Biometrics and Epidemiology*

It is one of the phenomenons in biostatistics that regression models for continuous, binary, nominal, or ordinal outcomes almost completely rely on parametric modelling, whereas survival or time-to-event outcomes are mainly analyzed by the Proportional Hazards (PH) model of Cox, which is an essentially non-parametric method. The Cox model has become one of the most used statistical models in applied research and the original article from 1972 ranks below the top 100 papers (in terms of citation frequency) across all areas of science.

However, the Cox model and the hazard ratio have also been criticized recently. For example, researchers have been warned to use the magnitude of the HR to describe the magnitude of the relative risk, because the hazard ratio is a ratio of rates, and not one of risks. Hazard ratios, even in randomized trials, have a built-in “selection bias”, because they are conditional measures, conditioning at each time point on the set of observations which is still under risk. Finally, the hazard ratio has been criticized for being non-collapsible. That is, adjusting for a covariate that is associated with the event will in general change the HR, even if this covariate is not associated with the exposure, that is, is no confounder.

In view of these disadvantages it is surprising that parametric survival models are not preferred over the Cox model. These existed long before the Cox model, are easier to comprehend, estimate, and communicate, and, above all, do not have any of the disadvantages mentioned.

**Do we still need hazard ratios? (II)**

Jan Beyersmann*Ulm University, Germany*

The answer to the question whether we need hazard ratios depends to a good deal on the answer to the question what we need hazards for. Censoring plays a key role. Censoring makes survival and event history analysis special. One important consequence is that less customized statistical techniques will be biased when applied to censored data. Another important consequence is that hazards remain identifiable under rather general censoring mechanisms. In this talk, I will demonstrate that there is a Babylonian confusion on „independent censoring“ in the textbook literature, which is a worry in its own right. Event-driven trials in pharmaceutical research or competing risks are two examples where the textbook literature often goes haywire, censoring-wise. It is a small step from this mess to misinterpretations of hazards, a challenge not diminished when the aim is a causal interpretation. Causal reasoning, however, appears to be spearheading the current attack on hazards and their ratios.

In philosophy, causality has pretty much been destroyed by David Hume. This does not imply that statisticians should avoid causal reasoning, but it might suggest some modesty. In fact, statistical causality is mostly about interventions, and a causal survival analysis often aims at statements about the intervention „do(no censoring)“, which, however, is not what identifiability of hazards is about. The current debate about estimands (in time-to-event trials) is an example where these issues are hopelessly mixed up.

The aim of this talk is to mix it up a bit further or, perhaps, even shed some light. Time permitting, I will illustrate matters using g-computation in the form of a causal variant of the Aalen-Johansen-estimator.

### Open Topics

**Using Historical Data to Predict Health Outcomes – The Prediction Design**

Stella Erdmann, Manuel Feißt, Johannes Krisam, Meinhard Kieser*Institute of Medical Biometry and Informatics, University of Heidelberg, Germany*

The gold standard for the investigation of the efficacy of a new therapy is a randomized controlled trial (RCT). This is costly, time consuming and not always practicable (e.g. for lethal conditions with limited treatment possibilities) or even possible in a reasonable time frame (e.g. in rare diseases due to small sample sizes). At the same time, huge quantities of available control-condition data in analyzable format of former RCTs or real-world data (RWD), i.e., patient‐level data gathered outside the conventional clinical trial setting, are neglected if not often completely ignored. To overcome these shortcomings, alternative study designs using data more efficiently would be desirable.

Assuming that the standard therapy and its mode of functioning is well known and large volumes of patient data exist, it is possible to set up a sound prediction model to determine the treatment effect of this standard therapy for future patients. When a new therapy is intended to be tested against the standard therapy, the vision would be to conduct a single-arm trial and to use the prediction model to determine the effect of the standard therapy on the outcome of interest of patients receiving the test treatment only, instead of setting up a two-arm trial for this comparison. While the advantages of using historical data to estimate the counterfactual are obvious (increased efficiency, lower cost, alleviating participants’ fear of being on placebo), bias could be caused by confounding (e.g. by indication, severity, or prognosis) or a number of other data issues that could jeopardize the validity of the non-randomized comparison.

The aim is to investigate if and how such a design – the prediction design – may be used to provide information on treatment effects by leveraging existing infrastructure and data sources (historical data of RCTs and/or RWD). Therefore, we investigate under what assumptions a linear prediction model could be used to predict the counterfactual of patients precisely enough to construct a test for evaluating the treatment effect for normally distributed endpoints. In particular, it is investigated what amount of data is necessary (for the historical data and for the single arm trial to be conducted). Via simulation studies, it is examined how sensible the design acts towards violations of the assumptions. The results are compared to reasonable (conventional) benchmark scenarios, e.g., the setting of a single-arm study with pre-defined threshold or a setting, where a propensity score matching was performed.

**Arguments for exhuming nonnegative garrote out of grave**

Edwin Kipruto, Willi Sauerbrei*Medical Center-University of Freiburg, Germany*

Background: The original nonnegative garrote (Breiman 1995) seems to have been forgotten despite some of its good conceptual properties. Its unpopularity is probably caused by dependence on least square estimates which does not have solution in high dimensional data and performs very poorly in high degree of multicollinearity. However, Yuan and Lin (2007) showed that nonnegative garrote is a flexible approach that can be used in combination with other estimators besides least squares such as ridge hence the aforementioned challenges can be circumvented; despite this proposal, it is hardly used in practice. Considerable attention has been given to prediction models compared to descriptive models where the aim is to summarize the data structure in a compact manner (Shmueli, 2010). Here our main interest is on descriptive modeling and as a byproduct we will present results of prediction.

Objectives: To evaluate the performance of nonnegative garrote and compare results with some popular approaches using three different real datasets with low to high degree of multicollinearity and in high dimensional data

Methods: We evaluated four penalized regression methods (Nonnegative garrote, lasso, adaptive lasso, relaxed lasso) and two classical variable selection methods (best subset, backward elimination) with and without post-estimation shrinkage.

Results: Nonnegative garrote can be used with other initial estimators besides least squares in highly correlated data and in high dimensional datasets. Negligible differences in predictions were observed in methods while considerable differences were observed in the number of variables selected.

Conclusion: To fit nonnegative garrote in highly correlated data and in high dimensional settings the proposed initial estimates can be used as an alternative to least squares estimates.

**On the assessment of methods to identify influential points in high-dimensional data**

Shuo Wang, Edwin Kipruto, Willi Sauerbrei

Medical Center – University of Freiburg, Germany

Extreme values and influential points in predictors often strongly affect the results of statistical analyses in low and high-dimensional settings. Many methods to detect such values have been proposed but there is no consensus on advantages and disadvantages as well as guidance for practice. We will present various classes of methods and illustrate their use in several high-dimensional data. First, we consider a simple pre-transformation which is combined with feature ranking lists to identify influential points, concentrating on univariable situations (Boulesteix and Sauerbrei, 2011, DOI: 10.1002/bimj.201000189). The procedure will be extended by checking for influential points in bivariate models and by adding some steps to the multivariable approach.

Second, to increase stability of feature ranking lists, we will use various aggregation approaches to explore for extreme values in features and influential observations. The former incurs the rank changes of a specific feature, while the latter causes a universal ranking change. For the detection of extreme values, we employ the simple pretransformation on data and detect the features whose ranks significantly changed after the transformation. For the detection of influential observations, we consider a combination of leave-one-out and rank comparison to detect the observations causing large rank changes. These methods are applied in several publicly available datasets.

**Acceleration of diagnostic research: Is there a potential for seamless designs?**

Werner Vach^{1}, Eric Bibiza-Freiwald^{2}, Oke Gerke^{3}, Tim Friede^{4}, Patrick Bossuyt^{5}, Antonia Zapf^{2}^{1}Basel Academy for Quality and Research in Medicine, Switzerland; ^{2}Institute of Medical Biometry and Epidemiology, University Medical Center Hamburg-Eppendorf; ^{3}Department of Nuclear Medicine, Odense University Hospital; ^{4}Department of Medical Statistics, University Medical Center Goettingen; ^{5}Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Amsterdam University Medical Centers

Background: New diagnostic tests to identify a well-established disease state have to undergo a series of scientific studies from test construction until finally demonstrating a societal impact. Traditionally, these studies are performed with substantial time gaps in between. Seamless designs allow us to combine a sequence of studies in one protocol and may hence accelerate this process.

Aim: A systematic investigation of the potential of seamless designs in diagnostic research.

Methods: We summarized the major study types in diagnostic research and identified their basic characteristics with respect to applying seamless designs. This information was used to identify major hurdles and opportunities for seamless designs.

Results: 11 major study types were identified. The following basic characteristics were identified: type of recruitment (case-control vs population-based), application of a reference standard, inclusion of a comparator, paired or unpaired application of a comparator, assessment of patient relevant outcomes, possibility for blinding of test results.

Two basic hurdles could be identified: 1) Accuracy studies are hard to combine with post-accuracy studies, as the first are required to justify the latter and as application of a reference test in outcome studies is a threat to the study’s integrity. 2) Questions, which can be clarified by other study designs, should be clarified before performing a randomized diagnostic study.

However, there is a substantial potential for seamless designs since all steps from the construction until the comparison with the current standard can be combined in one protocol. This may include a switch from case-control to population-based recruitment as well as a switch from a single arm study to a comparative accuracy study. In addition, change in management studies can be combined with an outcome study in discordant pairs. Examples from the literature illustrate the feasibility of both approaches.

Conclusions: There is a potential for seamless designs in diagnostic research.

Reference: Vach W, Bibiza E, Gerke O, Bossuyt PM, Friede T, Zapf A (2021). A potential for seamless designs in diagnostic research could be identified. J Clin Epidemiol. 29:51-59. doi: 10.1016/j.jclinepi.2020.09.019.

**The augmented binary method for composite endpoints based on forced vital capacity (FVC) in systemic sclerosis-associated interstitial lung disease**

Carolyn Cook^{1}, Michael Kreuter^{2}, Susanne Stowasser^{3}, Christian Stock^{4}^{1}mainanalytics GmbH, Sulzbach, Germany; ^{2}Center for Interstitial and Rare Lung Diseases, Pneumology and Respiratory Care Medicine, Thoraxklinik, University of Heidelberg, Heidelberg, Germany and German Center for Lung Research, Heidelberg, Germany; ^{3}Boehringer Ingelheim International GmbH, Ingelheim am Rhein, Germany; ^{4}Boehringer Ingelheim Pharma GmbH & Co. KG, Ingelheim am Rhein, Germany

Background

The augmented binary method (Wason & Seaman. Stat Med, 2013; 32(26)) is a novel method for precisely estimating response rates and differences among response rates defined based on a composite endpoint that contains a dichotomized continuous variable and additional inherently binary components. The method is an alternative to traditional approaches such as logistic regression techniques. Due to the complexity and computational demands of the method, experience in clinical studies has been limited thus far and is mainly restricted to oncological studies. Operating characteristics and, thus, potential statistical benefits are unclear for other settings.

Objective

We aimed to perform a Monte Carlo simulation study to assess operating characteristics of the augmented binary method in settings relevant to randomized controlled trials and non-interventional studies in systemic sclerosis-associated interstitial lung disease (SSc-ILD), a rare, chronic autoimmune disease, where composite endpoints of the above described type are frequently applied.

Methods

An extensive simulation study was performed assessing type I error, power, coverage, and bias of the augmented binary method and a standard logistic model for the composite endpoint. Parameters were varied to resemble lung function decline (as measured through the forced vital capacity, FVC), hospitalization events and mortality in patients with SSc-ILD over a 1- and 2-year period. A relative treatment effect of 50% on FVC was assumed, while assumed effects on hospitalizations and mortality were derived from joined modeling analyses of existing trial data (as indirect effects of the treatment on FVC). Further, the methods were exemplarily applied to data from the SENSCIS trial, a phase III randomized, double-blind, placebo-controlled trial to investigate the efficacy and safety of nintedanib in patients with SSc-ILD.

Results

The simulation study is currently in progress and results will be available by the end of January. In preliminary results modest gains in power and precision were observed, with acceptable compromises of type I error, if any. In scenarios with lower statistical powers, these results were more likely to make a difference on inferences concerning the treatment effect. In the exemplary application of the augmented binary method to trial data confidence intervals and p-values on selected endpoints involving FVC decline, hospitalization and mortality were smaller.

Conclusion

Based on preliminary results from a simulation study, we identified areas where the augmented binary method conveys an appreciable advantage compared to standard methods.

### Evidence Based Medicine and Meta-Analysis I

**Network meta-analysis for components of complex interventions**

Nicky Welton*University of Bristol, UK*

Meta-analysis is used to combine results from studies identified in a systematic review comparing specific interventions for a given patient population. However, the validity of the pooled estimate from a meta-analysis relies on the study results being similar enough to pool (homogeneity). Heterogeneity in study results can arise for various reasons, including differences in intervention definitions between studies. Network-meta-analysis (NMA) is an extension of meta-analysis that can combine results from studies to estimate relative effects between multiple (2 or more) interventions, where each study compares some (2 or more) of the interventions of interest. NMA can reduce heterogeneity by treating each intervention definition as a distinct intervention. However, if there are many distinct interventions then evidence networks may be sparse or disconnected so that relative effect estimates are imprecise or not possible to estimate at all. Interventions can sometimes be considered to be made up of component parts, such as some complex interventions or combination therapies.

Component network meta-analysis has been proposed for the synthesis of complex interventions that can be considered a sum of component parts. Component NMA is a form of network meta-regression that estimates the effect of the presence of particular components of an intervention. We discuss methods for categorisation of intervention components, before going on to introduce statistical models for the analysis of the relative efficacy of specific components or combinations of components. The methods respect the randomisation in the included trials and allow the analyst to explore whether the component effects are additive, or if there are interactions between them. The full interaction model corresponds to a standard NMA model.

We illustrate the methods with a range of examples including CBT for depression, electronic interventions for smoking cessation, school-based interventions for anxiety and depression, and psychological interventions for patients with coronary heart disease. We discuss the benefits of component NMA for increasing precision and connecting networks of evidence, the data requirements to fit the models, and make recommendations for the design and reporting of future randomised controlled trials of complex interventions that comprise component parts.

**Model selection for component network meta-analysis in disconnected networks: a case study**

Maria Petropoulou, Guido Schwarzer, Gerta Rücker*Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg, Germany*

Standard network meta-analysis (NMA) synthesizes direct and indirect evidence of randomized controlled trials (RCTs), estimating the effects of several competing interventions. Many healthcare interventions are complex, consisting of multiple, possibly interacting, components. In such cases, more general models, the component network meta-analysis (CNMA) models, allow estimating the effects of components of interventions.

Standard network meta-analysis requires a connected network. However, sometimes a disconnected network (two or more subnetworks) can occur when synthesizing evidence from RCTs. Bridging the gap between subnetworks is a challenging issue. CNMA models allow to “reconnect” a network with multi-component interventions if there are common components in subnetworks. Forward model selection for CNMA models, which has recently been developed, starts with a sparse CNMA model and, by adding interaction terms, ends up with a rich CNMA. By model selection, the best CNMA model is chosen based on a trade-off between goodness of fit (minimizing Cochran’s Q statistic) and connectivity.

Our aim is to check whether CNMA models for disconnected networks can validly re-estimate the results of a standard NMA for a connected network (benchmark). We applied the methods to a case study comparing 27 interventions for any adverse event of postoperative nausea and vomiting. Starting with the connected network, we artificially constructed disconnected networks in a systematic way without dropping interventions, such that the network keeps its size. We ended up with nine disconnected networks differing in network geometry, the number of included studies, and pairwise comparisons. The forward strategy for selecting appropriate CNMA models was implemented and the best CNMA model was identified for each disconnected network.

We compared the results of the best CNMA model for each disconnected network to the corresponding results for the connected network with respect to bias and standard error. We found that the results of the best CNMA models from each disconnected network are comparable with the benchmark. Based on our findings, we conclude that CNMA models, which are entirely based on RCT evidence, are a promising tool to deal with disconnected networks if some treatments have common components in different subnetworks. Additional analyses are planned to be conducted to simulated data under several scenarios for the generalization of results.

**Uncertainty in treatment hierarchy in network meta-analysis: making ranking relevant**

Theodoros Papakonstantinou^{1,2}, Georgia Salanti^{1}, Dimitris Mavridis^{3,4}, Gerta Rücker^{2}, Guido Schwarzer^{2}, Adriani Nikolakopoulou^{1,2}^{1}Institute of Social and Preventive Medicine, University of Bern, Switzerland; ^{2}Institute of Medical Biometry and Statistics, University of Freiburg, Germany; ^{3}Department of Primary Education, University of Ioannina, Ioannina, Greece; ^{4}Faculty of Medicine, Paris Descartes University, Paris, France

Network meta-analysis estimates all relative effects between competing treatments and can produce a treatment hierarchy from the least to the most desirable option. While about half of the published network meta-analyses report a ranking metric for the primary outcome, methodologists debate several issues underpinning the derivation of a treatment hierarchy. Criticisms include that ranking metrics are not accompanied by a measure of uncertainty or do not answer a clinically relevant question.

We will present a series of research questions related to network meta-analysis. For each of them, we will derive hierarchies that satisfy the set of constraints that constitute the research question and define the uncertainty of these hierarchies. We have developed an R package to calculate the treatment hierarchies.

Assuming a network of T treatments, we start by deriving the most probable hierarchies along with their probabilities. We derive the probabilities of each possible treatment hierarchy (T! permutations in total) by sampling from a multivariate normal distribution with relative treatment effects as means and corresponding variance-covariance matrix. Having the frequencies for each treatment hierarchy to arise, we define complex clinical questions: probability that (1) a specific hierarchy occurs, (2) a given order is retained in the network (e.g. A is better than B and B is better than C), (3) a specific triplet of quadruple of interventions is the most efficacious, (4) a treatment is in at a specific hierarchy position and (5) a treatment is in a specific or higher position in the hierarchy. These criteria can also be combined so that any number of them simultaneously holds, either of them holds or exactly one of them holds. For each defined question, we derive the hierarchies that satisfy the set criteria along with their probability. The sum of probabilities of all hierarchies that fulfill the criterion gives the probability of the criterion to hold. We extend the procedure to compare relative treatment effects against a clinically important value instead of the null effect.

We exemplify the method and its implementation using a network of four treatments for chronic obstructive pulmonary disease where the outcome of interest is mortality and is measured using odds ratio. The most probable hierarchy has a probability of 28%.

The developed methods extend the decision-making arsenal of evidence-based health care with tools that support clinicians, policy makers and patients to make better decisions about the best treatments for a given condition.

### Multiple Testing

**Analysis and sample size calculation for a conditional survival model with a binary surrogate endpoint**

Samuel Kilian, Johannes Krisam, Meinhard Kieser*Institute of Medical Biometry and Informatics; University Heidelberg; Heidelberg, Germany*

The primary endpoint in oncology is usually overall survival, where differences between therapies may only be observable after many years. To avoid withholding of a promising therapy, preliminary approval based on a surrogate endpoint is possible in certain situations (Wallach et al., 2018). The approval has to be confirmed later when overall survival can be assessed. When this is done within the same study, the correlation between surrogate endpoint and overall survival has to be taken into account for sample size calculation and analysis. This relation can be modeled by means of a conditional survival model which was proposed by Xia et al. (2014). They investigated the correlation and assessed power of the logrank test but did not develop methods for statistical testing, parameter estimation, and sample size calculation.

In this talk, a new statistical testing procedure based on the conditional model and Maximum Likelihood (ML) estimators for its parameters will be presented. An asymptotic test for survival difference will be given and an approximate sample size formula will be derived. Furthermore, an exact test for survival difference and an algorithm for exact sample size determination will be provided. Type I error rate, power, and required sample size for both newly developed tests will be determined exactly. Sample sizes will be compared to those required for the logrank test.

It will be shown that for small sample sizes the asymptotic parametric test and the logrank test exceed the nominal significance level under the conditional model. For a given sample size, the power of the asymptotic and the exact parametric test is similar, whereas the power of the logrank test is considerably lower in many situations. The other way round, the sample size needed to attain a prespecified power is comparable for the asymptotic and the exact parametric test, but considerably higher for the logrank test in many situations.

We conclude that the presented exact test performs very well under the assumptions of the conditional model and is a better choice than the asymptotic parametric test or the logrank test, respectively. Furthermore, the talk will give some insights in performing exact calculations for parametric survival time models. This provides a fast and powerful method to evaluate parametric tests for survival difference, thus facilitating the planning, conduct, and analysis of oncology trials with the option of accelerated approval.

**The max-t Test in High-Dimensional Repeated Measures and Multivariate Designs**

Frank Konietschke*Charite Berlin, Germany*

Repeated measures (and multivariate) designs occur in a variety of different research areas. Hereby, the designs might be high-dimensional, i.e. more (possibly)

dependent than independent replications of the trial are observed. In recent years, several global testing procedures (studentized quadratic forms) have been proposed for the analysis of such data. Testing global null hypotheses, however, usually does not answer the main question of practitioners, which is the specific localization of significant time points or group*time interactions. The use of max-t tests on the contrary, can provide this important information. In this talk, we discuss its applicability in such designs. In particular, we approximate the distribution of the max t-test statistic using innovative resampling strategies. Extensive simulation studies show that the test is particularly suitable for the analysis of data sets with small sample sizes . A real data set

illustrates the application of the method.

**Graphical approaches for the control of generalized error rates**

Frank Bretz^{1}, David Robertson^{2}, James James Wason^{3}^{1}Novartis, Switzerland; ^{2}University of Cambridge, UK; ^{3}Newcastle University, UK

When simultaneously testing multiple hypotheses, the usual approach in the context of confirmatory clinical trials is to control the familywise error rate (FWER), which bounds the probability of making at least one false rejection. In many trial settings, these hypotheses will additionally have a hierarchical structure that reflects the relative importance and links between different clinical objectives. The graphical approach of Bretz et al (2009) is a flexible and easily communicable way of controlling the FWER while respecting complex trial objectives and multiple structured hypotheses. However, the FWER can be a very stringent criterion that leads to procedures with low power, and may not be appropriate in exploratory trial settings. This motivates controlling generalized error rates, particularly when the number of hypotheses tested is no longer small. We consider the generalized familywise error rate (k-FWER), which is the probability of making k or more false rejections, as well as the tail probability of the false discovery proportion (FDP), which is the probability that the proportion of false rejections is greater than some threshold. We also consider asymptotic control of the false discovery rate, which is the expectation of the FDP. In this presentation, we show how to control these generalized error rates when using the graphical approach and its extensions. We demonstrate the utility of the resulting graphical procedures on clinical trial case studies.

**Statistical Inference for Diagnostic Test Accuracy Studies with Multiple Comparisons**

Max Westphal^{1}, Antonia Zapf^{2}^{1}Fraunhofer Institute for Digital Medicine MEVIS, Bremen, Germany; ^{2}Institute of Medical Biometry and Epidemiology, UKE Hamburg, Hamburg, Germany

Diagnostic accuracy studies are usually designed to assess the sensitivity and specificity of an index test in relation to a reference standard or established comparative test. This so-called co-primary endpoint analysis has recently been extended to the case that multiple index tests are investigated [1]. Such a design is relevant in modern applications where many different (machine-learned) classification rules based on high dimensional data are considered initially as the final model selection can (partially) be based on data from the diagnostic accuracy study.

In this talk, we motivate the according hypothesis problem and propose different multiple test procedures for that matter. Besides classical parametric corrections (Bonferroni, maxT) we also consider Bootstrap approaches and a Bayesian procedure. We will present early findings from a simulation study to compare the (family-wise) error rate and power of all procedures.

A general observation from the simulation study is the wide variability of rejection rates under different (realistic and least-favorable) parameter configurations. We discuss these findings and possible future extensions of our numerical experiments. All methods have been implemented in a new R package which will also be introduced briefly.

References:

1. Westphal, Max, Antonia Zapf, and Werner Brannath. „A multiple testing framework for diagnostic accuracy studies with co-primary endpoints.“ arXiv preprint arXiv:1911.02982 (2019).

### Genetic Epidemiology

**Open questions to genetic epidemiologists**

Inke König*Universität zu Lübeck, Germany*

Given the rapid pace with which genomics and other ‐ omics disciplines are evolving, it is sometimes necessary to shift down a gear to consider more general scientific questions. In this line, we can formulate a number of questions for genetic epidemiologists to ponder on. These cover the areas of reproducibility, statistical significance, chance findings, precision medicine and overlaps with related fields such as bioinformatics and data science. Importantly, similar questions are being raised in other biostatistical fields. Answering these requires to think outside the box and to learn from other, related, disciplines. From that, possible hints at responses are presented to foster the further discussion of these topics.

**Pgainsim: A method to assess the mode of inheritance for quantitative trait loci in genome-wide association studies**

Nora Scherer^{1}, Peggy Sekula^{1}, Peter Pfaffelhuber^{2}, Anna Köttgen^{1}, Pascal Schlosser^{1}^{1}Institute of Genetic Epidemiology, Faculty of Medicine and Medical Center – University of Freiburg, Germany; ^{2}Faculty of Mathematics and Physics, University of Freiburg, Germany

Background: When performing genome-wide association studies (GWAS) conventionally an additive genetic model is used to explore whether a SNP is associated with a quantitative trait regardless of the actual mode of inheritance (MOI). Recessive and dominant genetic models are able to improve statistical power to identify non-additive variants. Moreover, the actual MOI is of interest for experimental follow-up projects. Here, we extend the concept of the p-gain statistic [1] to decide whether one of the three models provides significantly more information than the others.

Methods: We define the p-gain statistic of a genetic model by the comparison of the association p-value of the model with the smaller of the two p-values of the other models. Considering the p-gain as a random variable depending on a trait and a SNP in Hardy-Weinberg equilibrium under the null hypothesis of no genetic association we show that the distribution of the p-gain statistic depends only on the allele frequency (AF).

To determine critical values where the opposing modes can be rejected, we developed the R-package pgainsim (https://github.com/genepi-freiburg/pgainsim). First, the p-gain is simulated under the null hypothesis of no genetic association for a user-specified study size and AF. Then the critical values are derived as the observed quantiles of the empirical density of the p-gain. For applications with extensive multiple testing, the R-package provides an extension of the empirical critical values by a log-linear interpolation of the quantiles.

Results: We tested our method in the German Chronic Kidney Disease study with urinary concentrations of 1,462 metabolites with the goal to identify non-additive metabolite QTLs. For each metabolite we conducted a GWAS under the three models and identified 119 independent mQTLs for which pval_rec or pval_dom<4.6e-11 and pval_add>min(pval_rec,pval_dom). For 38 of these, the additive modelling was rejected based on the p-gain statistics after a Bonferroni adjustment for 1 Mio*549*2 tests. These included the LCT locus with a known dominant MOI, as well as several novel associations. A simulation study for additive and recessive associations with varying effect sizes evaluating false positive and false negative rates of the approach is ongoing.

Conclusion: This new extension of the p-gain statistic allows for differentiating MOIs for QTLs considering their AF and the study sample size, even in a setting with extensive multiple testing.

[1] Petersen, A. et al. (2012) On the hypothesis-free testing of metabolite ratios in genome-wide and metabolome-wide association studies. BMC Bioinformatics 13, 120.

**Genome-wide conditional independence testing with machine learning**

Marvin N. Wright^{1}, David S. Watson^{2,3}^{1}Leibniz Institute for Prevention Research and Epidemiology – BIPS, Bremen, Germany; ^{2}Oxford Internet Institute, University of Oxford, Oxford, UK; ^{3}Queen Mary University of London, London, UK

In genetic epidemiology, we are facing extremely high dimensional data and complex patterns such as gene-gene or gene-environment interactions. For this reason, it is promising to use machine learning instead of classical statistical methods to analyze such data. However, most methods for statistical inference with machine learning test against a marginal null hypothesis and by that cannot handle correlated predictor variables.

Building on the knockoff framework of Candès et al. (2018), we propose the conditional predictive impact (CPI), a provably consistent and unbiased estimator of a variables‘ association with a given outcome, conditional on a reduced set of predictor variables. The method works in conjunction with any supervised learning algorithm and loss function. Simulations confirm that our inference procedures successfully control type I error and achieve nominal coverage probability with greater power than alternative variable importance measures and other nonparametric tests of conditional independence. We apply our method to a gene expression dataset on breast cancer. Further, we propose a modification which avoids the computation of the high-dimensional knockoff matrix and is computationally feasible on data from genome-wide association studies.

References:

Candès, E., Fan, Y., Janson, L. and Lv, J. (2018). Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection. J Royal Stat Soc Ser B Methodol 80:551–577

**The key distinction between Association and Causality exemplified by individual ancestry proportions and gallbladder cancer risk in Chileans**

Justo Lorenzo Bermejo, Linda Zollner*Statistical Genetics Research Group, Institute of Medical Biometry and Informatics, University of Heidelberg, Germany*

Background: The translation of findings from observational studies into improved health policies requires further investigation of the type of relationship between the exposure of interest and particular disease outcomes. Observed associations can be due not only to underlying causal effects, but also to selection bias, reverse causation and confounding.

As an example, we consider the association between the proportion of Native American ancestry and the risk of gallbladder cancer (GBC) in genetically admixed Chileans. Worldwide, Chile shows the highest incidence of GBC, and the risk of this disease has been associated with the individual proportion of Native American – Mapuche ancestry. However, Chileans with large proportions of Mapuche ancestry live in the south of the country, have poorer access to the health system and could be exposed to distinct risk factors. We conducted a Mendelian Randomization (MR) study to investigate the causal relationship “Mapuche ancestry → GBC risk”.

Methods: To infer the potential causal effect of specific risk factors on health-related outcomes, MR takes advantage of the random inheritance of genetic variants and utilizes instrumental variables (IVs):

1. associated with the exposure of interest

2. independent of possible confounders of the association between the exposure and the outcome

3. independent of the outcome given the exposure and the confounders

Given the selected IVs meet the above assumptions, various MR approaches can be used to test causality, for example the inverse variance weighted (IVW) method.

In our example, we took advantage of ancestry informative markers (AIMs) with distinct allele frequencies in Mapuche and other components of the Chilean genome, namely European, African and Aymara-Quechua ancestry. After checking that the AIMs fulfilled the required assumptions, we utilized them as IVs for the individual proportion of Mapuche ancestry in two-sample MR (sample 1: 1,800 Chileans from the whole country, sample 2: 250 Chilean case-control pairs).

Results: We found strong evidence for a causal effect of Mapuche ancestry on GBC risk: IVW OR per 1% increase in the Mapuche proportion 1.02, 95%CI (1.01-1.03), Pval = 0.0001. To validate this finding, we performed several sensitivity analyses including radial MR and different combinations of genetic principal components to rule out population stratification unrelated to Mapuche ancestry.

Conclusion: Causal inference is key to unravel disease aetiology. In the present example, we demonstrate that Mapuche ancestry is causally linked to GBC risk. This result can now be used to refine GBC prevention programs in Chile.