## Location: https://wwu.zoom.us/j/93028999169

### Young talent awards IBS-DR

Chairs: Werner Brannath and Annette Kopp-Schneider

Internal validation for descriptive clustering of gene expression data
Anastasiia Holovchak (Bernd-Streitberg Laureate)
LMU Munich, Germany

Cluster algorithms are often used to analyse gene expression data to partition the genes into homogenous groups based on their expression levels across patients. In practice, one is confronted with a large variety of clustering algorithms and it is often unclear which should be selected. A common procedure consists of testing different algorithms with several input parameters and evaluating them with appropriate internal cluster validation indices. However, it is again unclear which of these indices should be selected.

In this work, I conduct a study that investigates the stability of four internal cluster validation indices (Calinski-Harabasz index, Davies-Bouldin index, Dunn Index, and Average Silhouette Width criterion), in particular their ability to identify clusterings that replicate on independent test data. For the purpose of this study, an example gene expression data set is repeatedly split into a training and a test data set. Several commonly used clustering algorithms such as K-means, agglomerative clustering algorithms (Single Linkage, Complete Linkage, and Average Linkage), and spectral clustering algorithm are applied to the training data. The resulting clusterings are assessed using the four internal validation indices under consideration. The cluster methods are then applied to the test data and the similarity between the index values for the clusterings on the training and on the test data is assessed. I analyse whether the cluster algorithms and input parameters that are indicated as the best choices by the internal validation indices on the training data are also the best choices on the test data. Moreover, the internal validation indices are used to choose the best clustering on the training data and the stability of this selection process is investigated by applying the selected algorithm/parameter setting on the test data (as measured through the adjusted Rand index).

The results may guide the selection of appropriate indices in the considered context of gene expression data. For example, in this study the Dunn index yields very unstable results in terms of the selection of the best input parameter, which can be seen as an inconvenience. In conclusion, the investigated internal cluster validation indices show very different behaviours and one should not put much confidence in a single validation index unless there is evidence – from the literature or from own investigations such as the one presented in this thesis – that it yields meaningful replicable results in the considered context.

Model selection characteristics when using MCP-Mod for dose-response gene expression data
Julia Christin Duda (Bernd-Streitberg Laureate)
TU Dortmund University, Germany

Classical approaches in clinical dose-finding trials rely on pairwise comparisons between doses and placebo. A methodological improvement is the MCP-Mod (Multiple Comparison Procedure and Modeling) approach, originally developed for Phase II trials. MCP-Mod combines multiple comparisons with modeling approaches in a multistage procedure. First, for a set of pre-specified candidate models, it is tested if any dose-response signal is present. Second, considering models with detected signal, either the best model is selected to fit the dose-response curve or model averaging is performed.

We extend the scope of application for MCP-Mod to in-vitro gene expression data and assess its characteristics regarding model selection for concentration gene expression curves. Precisely, we apply MCP-Mod on single genes of a high-dimensional gene expression data set, where human embryonic stem cells were exposed to eight concentration levels of the compound valproic acid (VPA). As candidate models we consider the sigmoid Emax (four-parameter log-logistic), linear, quadratic, Emax, exponential and beta model. Through simulations, we investigate the impact of omitting one or more models from the candidate model set to uncover possibly superfluous models and the precision and recall rates of selected models. Measured by the AIC, all models perform best for a considerable number of genes. For less noisy cases the popular sigmoid Emax model is  frequently selected. For more noisy data, often simpler models like the linear model are selected, but mostly without relevant performance advantage compared to the second-best model. Also, the commonly used Emax model has an unexpected low performance.

Temporal Dynamics in Generative Models
Maren Hackenberg (Bernd-Streitberg Laureate), Harald Binder
Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Germany

Uncovering underlying development patterns in longitudinal biomedical data is a first step towards understanding disease processes, but is complicated by the sparse time grid and individual-specific development patterns that often characterize such data. In epidemiological cohort studies and clinical registries, we are facing the question of what can be learned from the data in an early phase of the study, when only a baseline characterization and one follow-up measurement are available. Specifically, we considered a data scenario where an extensive characterisation is available at a baseline time point for each individual, but only a smaller subset of variables is measured again at an individually differing second time point, resulting in a very sparse (only two time points) and irregular time grid.

Inspired by recent advances that allow to combine deep learning with dynamic modeling, we employed a generative deep learning model to capture individual dynamics in a low-dimensional latent representation as solutions of ordinary differential equations (ODEs). Here, the variables measured only at baseline are used to infer individual-specific ODE parameters.

Additionally, we enriched the information of each individual by linking groups of individuals with similar underlying trajectories, which then serve as proxy information on the common temporal dynamics. Irregular spacing in time can thus be used to gain more information on individual dynamics by leveraging individuals’ similarity. Using simulated data, we showed that the model can recover individual trajectories from linear and non-linear ODE systems with two and four unknown parameters and infer groups of individuals with similar trajectories. The results illustrate that dynamic deep learning approaches can be adapted to such small data settings to provide an individual-level understanding of the dynamics governing individuals’ developments.

Discrete Subdistribution Hazard Models
Department of Medical Biometry, Informatics and Epidemiology, Rheinische Friedrich-Wilhelms-Universität Bonn, Germany

In many clinical and epidemiological studies the interest is in the analysis of the time T until the occurrence of an event of interest j that may occur along with one or more competing events. This requires suitable techniques for competing risks regression. The key quantity to describe competing risks data is the cumulative incidence function, which is defined in terms of the probability of experiencing j at or before time t.

A popular modeling approach for the cumulative incidence function is the proportional subdistribution hazard model by Fine and Gray (1999), which is a direct modeling approach for the cumulative incidence function of one specific event of interest. A limitation of the subdistribution hazard model is that it assumes continuously measured event times. In practice, however, the exact (continuous) event times are often not recorded. Instead, it may only be known that the events occurred between pairs of consecutive points in time (i.e., within pre-specified follow-up intervals). In these cases, time is measured on a discrete scale.

To address this issue, a technique for modeling subdistribution hazards with right-censored data in discrete time is proposed. The method is based on a weighted maximum likelihood estimation scheme for binary regression and results in consistent and asymptotically normal estimators of the model parameters. In addition, a set of tools to assess the calibration of discrete subdistribution hazard models is developed. They consist of a calibration plot for graphical assessments as well as a recalibration model including tests on calibration-in-the-large and refinement.

The modeling approach is illustrated by an analysis of nosocomial pneumonia in intensive care patients measured on a daily basis.

Netboost: Network Analysis Improves High-Dimensional Omics Analysis Through Local Dimensionality Reduction
Pascal Schlosser1,2 (Gustav-Adolf-Lienert Laureate), Jochen Knaus2, Maximilian Schmutz3, Konstanze Döhner4, Christoph Plass5, Lars Bullinger6, Rainer Claus3, Harald Binder2, Michael Lübbert7,8, Martin Schumacher2
1Institute of Genetic Epidemiology, Faculty of Medicine and Medical Center, University of Freiburg, Germany; 2Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Germany; 3Department of Hematology and Oncology, Augsburg University Medical Center, Augsburg, Germany; 4Department of Internal Medicine III, University Hospital of Ulm, Germany; 5Division of Cancer Epigenomics, German Cancer Research Center, Heidelberg, Germany; 6Hematology, Oncology and Tumor Immunology, Campus Virchow Hospital, Charite University Medicine, Berlin, Germany; 7Department of Hematology-Oncology, Medical Center, Faculty of Medicine, University of Freiburg, Germany; 8German Consortium for Translational Cancer Research (DKTK), Freiburg, Germany

State-of-the art selection methods fail to identify weak but cumulative effects of features found in many high-dimensional omics datasets. Nevertheless, these features play an important role in certain diseases. We present Netboost, a three-step dimension reduction technique. First, a boosting- or Spearman-correlation-based filter is combined with the topological overlap measure to identify the essential edges of the network. Second, sparse hierarchical clustering is applied on the selected edges to identify modules and finally module information is aggregated by the first principal components. We demonstrate the application of the newly developed Netboost in combination with CoxBoost for survival prediction of DNA methylation and gene expression data from 180 acute myeloid leukemia (AML) patients and show, based on cross-validated prediction error curve estimates, its prediction superiority over variable selection on the full dataset as well as over an alternative clustering approach. The identified signature related to chromatin modifying enzymes was replicated in an independent dataset, the phase II AMLSG 12-09 study. In a second application we combine Netboost with Random Forest classification and improve the disease classification error in RNA-sequencing data of Huntington’s disease mice. Netboost is a freely available Bioconductor R package for dimension reduction and hypothesis generation in high-dimensional omics applications.

### Keynote: Estimands and Causality / Closing Session

Chairs: Werner Brannath and Annette Kopp-Schneider

Semiparametric Sensitivity Analysis: Unmeasured Confounding in Observational Studies
Daniel Scharfstein
Department of Population Health Sciences, University of Utah School of Medicine, USA

Establishing cause-effect relationships from observational data often relies on untestable assumptions. It is crucial to know whether, and to what extent, the conclusions drawn from non-experimental studies are robust to potential unmeasured confounding. In this paper, we focus on the average causal effect (ACE) as our target of inference. We build on the work of Franks et al. (2019) and Robins et al. (2000) by specifying non-identified sensitivity parameters that govern a contrast between the conditional (on measured covariates) distributions of the outcome under treatment (control) between treated and untreated individuals. We use semi-parametric theory to derive the non-parametric efficient influence function of the ACE, for fixed sensitivity parameters. We utilize this influence function to construct a one-step, split-sample bias-corrected estimator of the ACE. Our estimator depends on semi-parametric models for the distribution of the observed data; importantly, these models do not impose any restrictions on the values of sensitivity analysis parameters.  We establish that our estimator has $\sqrt{n}$ asymptotics.  We utilize our methodology to evaluate the causal effect of smoking during pregnancy on birth weight. We also evaluate the performance of estimation procedure in a simulation study.  This is joint work with Razieh Nabi, Edward Kennedy, Ming-Yueh Huang, Matteo Bonvini and Marcela Smid.

Closing: Andreas Faldum, Werner Brannath / Annette Kopp-Schneider

### Panel Discussion: Do we still need hazard ratios?

Chair: Andreas Wienke

Panel
Jan Beyersmann (Ulm University), Oliver Kuß (Düsseldorf), Andreas Wienke (Halle)

Do we still need hazard ratios? (I)
Oliver Kuß
German Diabetes Center, Leibniz Institute for Diabetes Research at Heinrich Heine University Düsseldorf, Institute for Biometrics and Epidemiology

It is one of the phenomenons in biostatistics that regression models for continuous, binary, nominal, or ordinal outcomes almost completely rely on parametric modelling, whereas survival or time-to-event outcomes are mainly analyzed by the Proportional Hazards (PH) model of Cox, which is an essentially non-parametric method. The Cox model has become one of the most used statistical models in applied research and the original article from 1972 ranks below the top 100 papers (in terms of citation frequency) across all areas of science.

However, the Cox model and the hazard ratio have also been criticized recently. For example, researchers have been warned to use the magnitude of the HR to describe the magnitude of the relative risk, because the hazard ratio is a ratio of rates, and not one of risks. Hazard ratios, even in randomized trials, have a built-in “selection bias”, because they are conditional measures, conditioning at each time point on the set of observations which is still under risk. Finally, the hazard ratio has been criticized for being non-collapsible. That is, adjusting for a covariate that is associated with the event will in general change the HR, even if this covariate is not associated with the exposure, that is, is no confounder.

In view of these disadvantages it is surprising that parametric survival models are not preferred over the Cox model. These existed long before the Cox model, are easier to comprehend, estimate, and communicate, and, above all, do not have any of the disadvantages mentioned.

Do we still need hazard ratios? (II)
Jan Beyersmann
Ulm University, Germany

The answer to the question whether we need hazard ratios depends to a good deal on the answer to the question what we need hazards for. Censoring plays a key role. Censoring makes survival and event history analysis special. One important consequence is that less customized statistical techniques will be biased when applied to censored data. Another important consequence is that hazards remain identifiable under rather general censoring mechanisms. In this talk, I will demonstrate that there is a Babylonian confusion on „independent censoring“ in the textbook literature, which is a worry in its own right. Event-driven trials in pharmaceutical research or competing risks are two examples where the textbook literature often goes haywire, censoring-wise. It is a small step from this mess to misinterpretations of hazards, a challenge not diminished when the aim is a causal interpretation. Causal reasoning, however, appears to be spearheading the current attack on hazards and their ratios.

In philosophy, causality has pretty much been destroyed by David Hume. This does not imply that statisticians should avoid causal reasoning, but it might suggest some modesty. In fact, statistical causality is mostly about interventions, and a causal survival analysis often aims at statements about the intervention „do(no censoring)“, which, however, is not what identifiability of hazards is about. The current debate about estimands (in time-to-event trials) is an example where these issues are hopelessly mixed up.

The aim of this talk is to mix it up a bit further or, perhaps, even shed some light. Time permitting, I will illustrate matters using g-computation in the form of a causal variant of the Aalen-Johansen-estimator.

### Genetic Epidemiology

Chairs: Miriam Kesselmeier and Silke Szymczak

Open questions to genetic epidemiologists
Inke König
Universität zu Lübeck, Germany

Given the rapid pace with which genomics and other ‐ omics disciplines are evolving, it is sometimes necessary to shift down a gear to consider more general scientific questions. In this line, we can formulate a number of questions for genetic epidemiologists to ponder on. These cover the areas of reproducibility, statistical significance, chance findings, precision medicine and overlaps with related fields such as bioinformatics and data science. Importantly, similar questions are being raised in other biostatistical fields. Answering these requires to think outside the box and to learn from other, related, disciplines. From that, possible hints at responses are presented to foster the further discussion of these topics.

Pgainsim: A method to assess the mode of inheritance for quantitative trait loci in genome-wide association studies
Nora Scherer1, Peggy Sekula1, Peter Pfaffelhuber2, Anna Köttgen1, Pascal Schlosser1
1Institute of Genetic Epidemiology, Faculty of Medicine and Medical Center – University of Freiburg, Germany; 2Faculty of Mathematics and Physics, University of Freiburg, Germany

Background: When performing genome-wide association studies (GWAS) conventionally an additive genetic model is used to explore whether a SNP is associated with a quantitative trait regardless of the actual mode of inheritance (MOI). Recessive and dominant genetic models are able to improve statistical power to identify non-additive variants. Moreover, the actual MOI is of interest for experimental follow-up projects. Here, we extend the concept of the p-gain statistic [1] to decide whether one of the three models provides significantly more information than the others.

Methods: We define the p-gain statistic of a genetic model by the comparison of the association p-value of the model with the smaller of the two p-values of the other models. Considering the p-gain as a random variable depending on a trait and a SNP in Hardy-Weinberg equilibrium under the null hypothesis of no genetic association we show that the distribution of the p-gain statistic depends only on the allele frequency (AF).

To determine critical values where the opposing modes can be rejected, we developed the R-package pgainsim (https://github.com/genepi-freiburg/pgainsim). First, the p-gain is simulated under the null hypothesis of no genetic association for a user-specified study size and AF. Then the critical values are derived as the observed quantiles of the empirical density of the p-gain. For applications with extensive multiple testing, the R-package provides an extension of the empirical critical values by a log-linear interpolation of the quantiles.

Results: We tested our method in the German Chronic Kidney Disease study with urinary concentrations of 1,462 metabolites with the goal to identify non-additive metabolite QTLs. For each metabolite we conducted a GWAS under the three models and identified 119 independent mQTLs for which pval_rec or pval_dom<4.6e-11 and pval_add>min(pval_rec,pval_dom). For 38 of these, the additive modelling was rejected based on the p-gain statistics after a Bonferroni adjustment for 1 Mio*549*2 tests. These included the LCT locus with a known dominant MOI, as well as several novel associations. A simulation study for additive and recessive associations with varying effect sizes evaluating false positive and false negative rates of the approach is ongoing.

Conclusion: This new extension of the p-gain statistic allows for differentiating MOIs for QTLs considering their AF and the study sample size, even in a setting with extensive multiple testing.

[1] Petersen, A. et al. (2012) On the hypothesis-free testing of metabolite ratios in genome-wide and metabolome-wide association studies. BMC Bioinformatics 13, 120.

Genome-wide conditional independence testing with machine learning
Marvin N. Wright1, David S. Watson2,3
1Leibniz Institute for Prevention Research and Epidemiology – BIPS, Bremen, Germany; 2Oxford Internet Institute, University of Oxford, Oxford, UK; 3Queen Mary University of London, London, UK

In genetic epidemiology, we are facing extremely high dimensional data and complex patterns such as gene-gene or gene-environment interactions. For this reason, it is promising to use machine learning instead of classical statistical methods to analyze such data. However, most methods for statistical inference with machine learning test against a marginal null hypothesis and by that cannot handle correlated predictor variables.

Building on the knockoff framework of Candès et al. (2018), we propose the conditional predictive impact (CPI), a provably consistent and unbiased estimator of a variables‘ association with a given outcome, conditional on a reduced set of predictor variables. The method works in conjunction with any supervised learning algorithm and loss function. Simulations confirm that our inference procedures successfully control type I error and achieve nominal coverage probability with greater power than alternative variable importance measures and other nonparametric tests of conditional independence. We apply our method to a gene expression dataset on breast cancer. Further, we propose a modification which avoids the computation of the high-dimensional knockoff matrix and is computationally feasible on data from genome-wide association studies.

References:

Candès, E., Fan, Y., Janson, L. and Lv, J. (2018). Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection. J Royal Stat Soc Ser B Methodol 80:551–577

The key distinction between Association and Causality exemplified by individual ancestry proportions and gallbladder cancer risk in Chileans
Justo Lorenzo Bermejo, Linda Zollner
Statistical Genetics Research Group, Institute of Medical Biometry and Informatics, University of Heidelberg, Germany

Background: The translation of findings from observational studies into improved health policies requires further investigation of the type of relationship between the exposure of interest and particular disease outcomes. Observed associations can be due not only to underlying causal effects, but also to selection bias, reverse causation and confounding.

As an example, we consider the association between the proportion of Native American ancestry and the risk of gallbladder cancer (GBC) in genetically admixed Chileans. Worldwide, Chile shows the highest incidence of GBC, and the risk of this disease has been associated with the individual proportion of Native American – Mapuche ancestry. However, Chileans with large proportions of Mapuche ancestry live in the south of the country, have poorer access to the health system and could be exposed to distinct risk factors. We conducted a Mendelian Randomization (MR) study to investigate the causal relationship “Mapuche ancestry → GBC risk”.

Methods: To infer the potential causal effect of specific risk factors on health-related outcomes, MR takes advantage of the random inheritance of genetic variants and utilizes instrumental variables (IVs):

1. associated with the exposure of interest

2. independent of possible confounders of the association between the exposure and the outcome

3. independent of the outcome given the exposure and the confounders

Given the selected IVs meet the above assumptions, various MR approaches can be used to test causality, for example the inverse variance weighted (IVW) method.

In our example, we took advantage of ancestry informative markers (AIMs) with distinct allele frequencies in Mapuche and other components of the Chilean genome, namely European, African and Aymara-Quechua ancestry. After checking that the AIMs fulfilled the required assumptions, we utilized them as IVs for the individual proportion of Mapuche ancestry in two-sample MR (sample 1: 1,800 Chileans from the whole country, sample 2: 250 Chilean case-control pairs).

Results: We found strong evidence for a causal effect of Mapuche ancestry on GBC risk: IVW OR per 1% increase in the Mapuche proportion 1.02, 95%CI (1.01-1.03), Pval = 0.0001. To validate this finding, we performed several sensitivity analyses including radial MR and different combinations of genetic principal components to rule out population stratification unrelated to Mapuche ancestry.

Conclusion: Causal inference is key to unravel disease aetiology. In the present example, we demonstrate that Mapuche ancestry is causally linked to GBC risk. This result can now be used to refine GBC prevention programs in Chile.

### Statistics in Practice II

Chairs: Theresa Keller and Thomas Schmelter

Education for Statistics in Practice: Development and evaluation of prediction models: pitfalls and solutions
Ben Van Calster1, Maarten van Smeden2
1Department of Development and Regeneration, University of Leuven, Leuven, Belgium; 2Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, Netherlands

With fast developments in medical statistics, machine learning and artificial intelligence, the current opportunities for making accurate predictions about the future seem nearly endless. In this lecture we will share some experiences from a medical prediction perspective, where prediction modelling has a long history and models have been implemented in patient care with varying success. We will focus on best practices for the development, evaluation and presentation of prediction models, highlight some common pitfalls, present solutions to circumvent bad prediction modelling and discuss some methodological challenges for the future.

EXTENDED ABSTRACT

Prediction models are developed throughout science. In this session the focus will be on applications in the medical domain, where prediction models have a long history commonly serving either a diagnostic or prognostic purpose. The ultimate goal of such models is to assist in medical decision making by providing accurate predictions for future individuals.

As we anticipate participants to this session are already well versed in fitting statistical models to data, the focus will be on the common pitfalls when developing statistical (learning) and machine learning models with a prediction aim. Our goal is that the participants gain knowledge about the pitfalls of prediction modeling and increase their familiarity with methods providing solutions for these pitfalls.

The sessions will be arranged in sections of 20 to 30 minutes. The following topics will be covered.

State of the medical prediction modeling art

This section begins with a small introduction into the history of prediction modeling in medical research. Positive examples will be highlighted and we will draw from the extensive systematic review literature on clinical prediction models. Recent experiences with a living systematic review on COVID-19 related prediction modeling will be discussed.

Just another prediction model

For most health conditions prediction models already exist. How does one prevent that prediction modeling project ends up on the large failed and unused model pile? Using the PROGRESS framework, we discuss various prediction modeling goals. Some good modeling practices and the harm of commonly applied modeling methods are illustrated. Finally, we will highlight some recent developments in formalizing prediction goals (predictimands).

Methods against overfitting

Overfitting is arguably the biggest enemy of prediction modeling. There is a large literature on shrinkage estimators that aim at preventing overfitting. In this section we will reflect on the history of shrinkage methods (e.g. Stein’s estimator & Le Cessie van Houwelingen heuristic shrinkage) and more recent developments (e.g. lasso and ridge regression variants). The advantages and limitations will be discussed.

Methods for deciding on appropriate sample size

Rules of thumb have dominated the discussions on sample size for prediction models for decades (e.g. the need for at least 10 events for every predictor considered). The history and limitations of these rules of thumb will be shown. Recently developed sample size criteria for prediction model development and validation will be presented.

Model performance and validation

Validation of prediction models goes beyond evaluation of model coefficients and goodness-of-fit tests. Prediction models should give higher risk estimates for events than for non-events (discrimination). Predictions may be used to support clinical decisions, therefore the risks should be accurate (calibration). We will describe various levels at which a model can be calibrated. Further, the performance of the model to classify patients into low vs high risk patients to support decision making can be evaluated. We discuss decision curve analysis, the most well-known tool for utility validation. The link between calibration and utility is explained.

Heterogeneity over time and place: there is no such thing as a validated model

We discuss the different levels of validation (apparent, internal, and external), and what they can tell us. However, it is increasingly recognized that one should expect performance to be heterogeneous between different settings/hospitals. This can be taken into account on many levels: we may focus on having clustered (e.g. multicenter data, IPD) datasets for model development and validation, internal-external cross-validation can be used during model development, and cluster-specific performance can be meta-analyzed at validation. If data allow, meta-regression can be used to gain insight in performance heterogeneity. Model updating can be used to adapt a model to a new setting. In addition, populations tend to change over time. This calls for continuous updating strategies.

Applied example

We will describe the development and validation of the ADNEX model to diagnose ovarian cancer. Development, validation, target population, meta-regression, validation studies, model updating, and implementation in ultrasound machines.

Future perspective: machine learning and AI

Flexible machine learning algorithms have been around for a while. Recently, however, we have observed a strong increase in their use. We discuss challenges for these methods, such as data hungriness, the risk of automation, increasing complexity of model building, the no free lunch idea, and the winner’s curse.

### Statistics in Practice I

Chairs: Theresa Keller and Willi Sauerbrei

Education for Statistics in Practice: Development and evaluation of prediction models: pitfalls and solutions
Ben Van Calster1, Maarten van Smeden2
1Department of Development and Regeneration, University of Leuven, Leuven, Belgium; 2Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, Netherlands

With fast developments in medical statistics, machine learning and artificial intelligence, the current opportunities for making accurate predictions about the future seem nearly endless. In this lecture we will share some experiences from a medical prediction perspective, where prediction modelling has a long history and models have been implemented in patient care with varying success. We will focus on best practices for the development, evaluation and presentation of prediction models, highlight some common pitfalls, present solutions to circumvent bad prediction modelling and discuss some methodological challenges for the future.

EXTENDED ABSTRACT

Prediction models are developed throughout science. In this session the focus will be on applications in the medical domain, where prediction models have a long history commonly serving either a diagnostic or prognostic purpose. The ultimate goal of such models is to assist in medical decision making by providing accurate predictions for future individuals.

As we anticipate participants to this session are already well versed in fitting statistical models to data, the focus will be on the common pitfalls when developing statistical (learning) and machine learning models with a prediction aim. Our goal is that the participants gain knowledge about the pitfalls of prediction modeling and increase their familiarity with methods providing solutions for these pitfalls.

The sessions will be arranged in sections of 20 to 30 minutes. The following topics will be covered.

State of the medical prediction modeling art

This section begins with a small introduction into the history of prediction modeling in medical research. Positive examples will be highlighted and we will draw from the extensive systematic review literature on clinical prediction models. Recent experiences with a living systematic review on COVID-19 related prediction modeling will be discussed.

Just another prediction model

For most health conditions prediction models already exist. How does one prevent that prediction modeling project ends up on the large failed and unused model pile? Using the PROGRESS framework, we discuss various prediction modeling goals. Some good modeling practices and the harm of commonly applied modeling methods are illustrated. Finally, we will highlight some recent developments in formalizing prediction goals (predictimands).

Methods against overfitting

Overfitting is arguably the biggest enemy of prediction modeling. There is a large literature on shrinkage estimators that aim at preventing overfitting. In this section we will reflect on the history of shrinkage methods (e.g. Stein’s estimator & Le Cessie van Houwelingen heuristic shrinkage) and more recent developments (e.g. lasso and ridge regression variants). The advantages and limitations will be discussed.

Methods for deciding on appropriate sample size

Rules of thumb have dominated the discussions on sample size for prediction models for decades (e.g. the need for at least 10 events for every predictor considered). The history and limitations of these rules of thumb will be shown. Recently developed sample size criteria for prediction model development and validation will be presented.

Model performance and validation

Validation of prediction models goes beyond evaluation of model coefficients and goodness-of-fit tests. Prediction models should give higher risk estimates for events than for non-events (discrimination). Predictions may be used to support clinical decisions, therefore the risks should be accurate (calibration). We will describe various levels at which a model can be calibrated. Further, the performance of the model to classify patients into low vs high risk patients to support decision making can be evaluated. We discuss decision curve analysis, the most well-known tool for utility validation. The link between calibration and utility is explained.

Heterogeneity over time and place: there is no such thing as a validated model

We discuss the different levels of validation (apparent, internal, and external), and what they can tell us. However, it is increasingly recognized that one should expect performance to be heterogeneous between different settings/hospitals. This can be taken into account on many levels: we may focus on having clustered (e.g. multicenter data, IPD) datasets for model development and validation, internal-external cross-validation can be used during model development, and cluster-specific performance can be meta-analyzed at validation. If data allow, meta-regression can be used to gain insight in performance heterogeneity. Model updating can be used to adapt a model to a new setting. In addition, populations tend to change over time. This calls for continuous updating strategies.

Applied example

We will describe the development and validation of the ADNEX model to diagnose ovarian cancer. Development, validation, target population, meta-regression, validation studies, model updating, and implementation in ultrasound machines.

Future perspective: machine learning and AI

Flexible machine learning algorithms have been around for a while. Recently, however, we have observed a strong increase in their use. We discuss challenges for these methods, such as data hungriness, the risk of automation, increasing complexity of model building, the no free lunch idea, and the winner’s curse.

### Panel Discussion: Drug Development beyond Traditional Paths

Chairs: Cornelia-Ursula Kunz and Kaspar Rufibach

Lisa Hampson1, Frank Fleischer2
1Advanced Methodology & Data Science, Novartis Pharma AG, Switzerland; 2Biostatistics & Data Sciences, Boehringer Ingelheim Pharma, Germany

Methodological collaborations between academia and the pharmaceutical industry can have several benefits for both parties. In addition to the development and application of new statistical methods, there is also the education and recruitment of the next generation of biostatisticians and data scientists. In this presentation, we begin by reflecting on the key components (and maybe some pitfalls) of an academia-industry collaboration. We consider the different models that these collaborations can follow, ranging from co-supervision of student projects to collaborations between institutions. We will use several examples to illustrate the various models and their direct impact on statistical methodology and the business. Topics covered are diverse and range from data science to innovative clinical trial design. We conclude by looking to the future and provide an overview of emerging methodological questions in the pharmaceutical industry which we think are ripe for future academia-industry partnerships.

### Statistical Machine Learning II

Chairs: Harald Binder and Marvin Wright

Variable relation analysis utilizing surrogate variables in random forests
Stephan Seifert1, Sven Gundlach2, Silke Szymczak3
1University of Hamburg; 2Kiel University; 3University of Lübeck

The machine learning approach random forests [1] can be successfully applied to omics data, such as gene expression data, for classification or regression. However, the interpretation of the trained prediction models is currently mainly limited to the selection of relevant variables identified based on so-called importance measurements of each individual variable. Thus, relationships between the predictor variables are not considered. We developed a new RF based variable selection method called Surrogate Minimal Depth (SMD) that incorporates variable relations into the selection process of important variables. [2] This is achieved by the exploitation of surrogate variables that have originally been introduced to deal with missing predictor variables. [3] In addition to improving variable selection, surrogate variables and their relationship to the primary split variables measured by the parameter mean adjusted agreement can also be utilized as proxy for the relations between the different variables. This relation analysis goes beyond the investigation of ordinary correlation coefficients because it takes into account the association with the outcome. I will present the basic concept of surrogate variables and mean adjusted agreement, as well as the relation analysis of simulated data as proof of concept and the investigation of experimental breast cancer gene expression datasets to show the practical applicability of this new approach.

References

[1] L. Breiman, Mach. Learn. 2001, 45, 5-32.

[2] S. Seifert, S. Gundlach, S. Szymczak, Bioinformatics 2019, 35, 3663-3671.

[3] L. Breiman, J. Friedman, C. J. Stone, R. A. Olshen, Classification and Regression Trees, Taylor & Francis, 1984.

Variable Importance in Random Forests in the Presence of Confounding
Robert Miltenberger, Christoph Wies, Gunter Grieser, Antje Jahn
University of applied sciences Darmstadt, Deutschland

Patients with a need for kidney transplantation suffer from a lack of available organ donors. Still, patients commonly reject an allocated kidney when they consider its quality to be insufficient [1]. Rejection is of major concern as it can reduce the organs quality due to prolonged ischemic time and thus its use for further patients. To better understand the association between organ quality and patient prognosis after transplantation, random survival forests will be applied to data on more than 50.000 kidney transplantations of the US organ transplantation registry. However, the US allocation process is allocating kidneys of high quality to patients with good prognosis. Thus confounding is of major concern and needs to be adressed.

In this talk, we investigate methods to address confounding in random forest analysis by using residuals from a generalized propensity score analysis. We show, that by considering the residuals instead of original variables the permutation variable importance measures refer to semipartial correlations between outcome and variable instead of correlations that are disturbed by confounder effects. This facilitates the interpretation of the variable importance measure. As our findings rely on linear models, we further investigate the approach for non-linear and non-additive models by the use of simulations.

The proposed method is used to analyse the impact of kidney quality on failure-free survival after transplantation based on the US registry data. Results are compared to other methods, that have been proposed for a better understanding and explainability of random forest analyses [2].

[1] Husain SA et.al.: Association Between Declined Offers of Deceased Donor Kidney Allograft and Outcomes in Kidney Transplant Candidates. JAMA Netw Open. 2019; doi:10.1001/jamanetworkopen.2019.10312

[2] Paluszynska A, Przemyslaw Biecek P and Jiang Y (2020). randomForestExplainer: Explaining and Visualizing Random Forests in Terms of Variable Importance. R package version 0.10.1. https://CRAN.R-project.org/package=randomForestExplainer

Identification of representative trees in random forests based on a new tree-based distance measure
Björn-Hergen Laabs, Inke R. König
Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Germany

In life sciences random forests are often used to train predictive models, but it is rather complex to gain any explanatory insight into the mechanics leading to a specific outcome, which impedes the implementation of random forests in clinical practice. Typically, variable importance measures are used, but they can neither explain how a variable influences the outcome nor find interactions between variables; furthermore, they ignore the tree structure in the forest in total. A different approach is to select a single or a set of a few trees from the ensemble which best represent the forest. It is hoped that by simplifying a complex ensemble of decision trees to a set of a few representative trees, it is possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants.

The intuitive definition of representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. The currently proposed tree-based distance metrics[1] compare trees regarding either the prediction, the clustering in the terminal nodes, or the variables that were used for splitting. Therefore they either need an additional data set for calculating the distances or capture only few aspects of the tree architecture. Thus, we developed a new tree-based distance measure, which does not use an additional data set and incorporates more of the tree structure, by evaluating not only whether a certain variable was used for splitting in the tree, but also where in the tree it was used. We compared our new method with the existing metrics in an extensive simulation study and show that our new distance metric is superior in depicting the differences in tree structures. Furthermore, we found that the most representative tree selected by our method has the best prediction performance on independent validation data compared to the trees selected by other metrics.

[1] Banerjee et al. (2012), Identifying representative trees from ensembles, Statistics in Medicine 31(15), 1601-16

Interaction forests: Identifying and exploiting influential quantitative and qualitative interaction effects
Roman Hornung
University of Munich, Germany

Even though interaction effects are omnipresent in biomedical data and play a particularly prominent role in genetics, they are given little attention in analysis, in particular in prediction modelling. Identifying influential interaction effects is valuable, both, because they allow important insights into the interplay between the covariates and because these effects can be used to improve the prediction performance of automatic prediction rules.

Random forest is one of the most popular machine learning methods and known for its ability to capture complex non-linear dependencies between the covariates and the outcome. A key feature of random forest is that it allows to rank the considered covariates with respect to their contribution to prediction using various variable importance measures.

We developed ‚interaction forest‘, a variation of random forest for categorical, metric, and survival outcomes that explicitly considers several types of interaction effects in the splitting performed by the trees constituting the forest. The new ‚effect importance measure (EIM)‘ associated with interaction forest allows to rank the interaction effects between the covariate pairs with respect to their importance for prediction in addition to ranking the univariable effects of the covariates in this respect. Using EIM, separate importance value lists for univariable effects, quantitative interaction effects, and qualitative interaction effects are provided. In a real data study using 220 publicly available data sets it is seen that the prediction performance of interaction forest is statistically significantly better than that of random forest and competing random forest variants that, as does interaction forest, use multivariable splitting. Moreover, a simulation study suggests that EIM allows to identify consistently the relevant quantitative and qualitative interaction effects in datasets. Here, the rankings obtained from the EIM value lists for quantitative interaction effects on the one hand and qualitative interaction effects on the other are confirmed to be specific for each of these two types of interaction effects. These results indicate that interaction forest is a suitable tool for identifying and making use of relevant interaction effects in prediction modelling.

A Machine Learning Approach to Empirical Dynamic Modeling for Biochemical Systems
Kevin Siswandi
University of Freiburg, Germany

BACKGROUND

In the biosciences, dynamic modeling plays a very important role for understanding and predicting the temporal behaviour of biochemical systems, with wide-ranging applications from bioengineering to precision medicine. Traditionally, dynamic modeling (e.g. in systems biology) is commonly done with Ordinary Differential Equations (ODEs) to predict system dynamics. Such models are typically constructed based on first-principles equations (e.g. Michaelis-Menten kinetics) that are further iteratively modified to be consistent with experiments. Consequently, it could well take several years before a model is quantitatively predictive. Moreover, such ODE models do not scale with increasing amounts of data. At the same time, the demand for high accuracy predictions is increasing in the biotechnology and synthetic biology industry. Here, we investigate a data-driven approach based on machine learning for empirical dynamic modeling that can allow for faster development relative to traditional first-principles modeling, with a particular focus on biochemical systems.

METHODS

We present a numerical framework for a machine learning approach to discover dynamics from time-series data. The main workflow consists of data augmentation, model training and validation, numerical integration, and model explanation. In contrast to other works, our method does not assume any prior (biological) knowledge or governing equations.

Specifically, by posing it as a supervised learning problem, the dynamics can be reconstructed from time-series measurements through solving the resulting optimisation problem. This is done by embedding it within the classical framework of a numerical method (e.g. linear multi-step method or LMM). We evaluate this approach on canonical systems and complex biochemical systems with nonlinear dynamics.

RESULTS

We show that this method can discover the dynamics of our test systems given enough data. We further find that it could discover bifurcations, is robust to noise, and capable of leveraging additional data to improve its prediction accuracy at scale. Finally, we employ various explainability studies to extract mechanistic insights from the biochemical systems.

CONCLUSION

By avoiding assumptions about specific mechanisms, we are able to propose a general machine learning workflow. Thus, it can be applied to any new systems (e.g. pathways or hosts), and could be used to capture complex dynamic relationships which are still unknown in the literature. We believe that it has the potential to accelerate the development of predictive dynamic models due to its data-driven approach.

### Statistical Machine Learning I

Chairs: Matthias Schmid and Thomas Welchowski

Interpretable Machine Learning
Bernd Bischl
Ludwig-Maximilians-Universität München

Adapting Variational Autoencoders for Realistic Synthetic Data with Skewed and Bimodal Distributions
Faculty of Medicine and Medical Center – University of Freiburg, Germany

Background: Passing synthetic data instead of original data to the other researchers is an option when data protection restrictions exist. Such data should preserve the statistical relationships between the variables while protecting privacy. In recent years, deep generative models have allowed for significant progress in the field of synthetic data generation. In particular, variational autoencoders (VAEs) are a popular class of deep generative models. Standard VAEs are typically built around a latent space with a Gaussian distribution and this is a key challenge for VAEs when they encounter more complex data distributions like bimodal or skewed data.

Methods: In this work, we propose a novel method for synthetic data generation that handles bimodal and skewed data as well, while keeping the overall VAE framework. Moreover, this method can generate synthetic data for datasets consisting of both continuous and binary variables. We apply two transformations to convert the data into a form that is more compliant with VAEs. First, we use Box-Cox transformations to transform the skewed distribution to something closer to a symmetric distribution. Then, dealing with potential bimodal data, we employ a power function sgn(x)|x|^p that can transform the data in a way that it has closer peaks and lighter tails. For the evaluation, we use a simulation design data, which is based on a large breast cancer study and The International Stroke Trial (IST) dataset as a real data example.

Results: We show that the pre-transformations can make a considerable improvement in the utility of synthetic data for skewed and bimodal distributions. We investigate this in comparison with standard VAEs and a VAE with an autoregressive implicit quantile network approach (AIQN) and also Generative Adversarial Networks (GAN). We see that our method is the only method that can generate bimodality and the other methods typically generate unimodal distributions. For skewed data, these methods decrease the skewness of synthetic data and make the data closer to a symmetric distribution while our method produces similar skewness to original data and honors the value range of original data better.

Conclusion: In conclusion, we developed a simple method, which adapts VAEs by transformations to handle skewed and bimodal data. Due to its simplicity, it is possible to combine it with many extensions of VAEs. Thus, it becomes feasible to generate high-quality synthetic clinical data for research under data protection constraints.

Statistical power for cell identity detection in deep generative models
Martin Treppner1,2, Harald Binder1,2
1Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg, Germany; 2Freiburg Center of Data Analysis and Modelling, Mathematical Institute – Faculty of Mathematics and Physics, University of Freiburg, Germany

One of the most common applications of single-cell RNA-sequencing experiments is to discover groups of cells with a similar expression profile in an attempt to define cell identities. The similarity of these expression profiles is typically examined in a low-dimensional latent space, which can be learned by deep generative models such as variational autoencoders (VAEs). However, the quality of representations in VAEs varies greatly depending on the number of cells under study, which is also reflected in the assignment to specific cell identities. We propose a strategy to answer what number of cells is needed so that a pre-specified percentage of the cells in the latent space is well represented.

We train VAEs on a varying number of cells and evaluate the learned representations‘ quality by use of the estimated log-likelihood lower bound of each cell. The distribution arising from the values of the log-likelihoods are then compared to a permutation-based distribution of log-likelihoods. We generate the permutation-based distribution by randomly drawing a small subset of cells before training the VAE and permuting each gene’s expression values among these randomly drawn cells. By doing so, we ensure that the latent representation’s overall structure is preserved, and at the same time, we obtain a null distribution for the log-likelihoods. We then compare log-likelihood distributions for different numbers of cells. We also harness the properties of VAEs by artificially increasing the number of samples in small datasets by generating synthetic data and combining them with the original pilot datasets.

We demonstrate performance on varying sizes of subsamples of the Tabula Muris scRNA-seq dataset from the brain of seven mice processed with the SMART-Seq2 protocol. We show that our approach can be used to plan cell numbers for single-cell RNA-seq experiments, which might improve the reliability of downstream analyses such as cell identity detection and inference of developmental trajectories.

Individualizing deep dynamic models for psychological resilience data
Göran Köber1,2, Shakoor Pooseh2,3, Haakon Engen4, Andrea Chmitorz5,6,7, Miriam Kampa5,8,9, Anita Schick4,10, Alexandra Sebastian6, Oliver Tüscher5,6, Michèle Wessa5,11, Kenneth S.L. Yuen4,5, Henrik Walter12,13, Raffael Kalisch4,5, Jens Timmer2,3,14, Harald Binder1,2
1Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Germany; 2Freiburg Center of Data Analysis and Modelling (FDM), University of Freiburg, Freiburg, 79104, Germany; 3Institute of Physics, University of Freiburg, 79104, Germany; 4Neuroimaging Center (NIC), Focus Program Translational Neuroscience (FTN), Johannes Gutenberg University Medical Center, Mainz, 55131, Germany; 5Leibniz Institute for Resilience Research (LIR), Mainz, 55122, Germany; 6Department of Psychiatry and Psychotherapy, Johannes Gutenberg University Medical Center, Mainz, 55131, Germany; 7Faculty of Social Work, Health and Nursing, University of Applied Sciences Esslingen, Esslingen, 73728, Germany; 8Department of Clinical Psychology, University of Siegen, 57076, Germany; 9Bender Institute of Neuroimaging (BION), Department of Psychology, Justus Liebig University, Gießen, 35394, Germany; 10Department of Public Mental Health, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Germany; 11Department of Clinical Psychology and Neuropsychology, Institute of Psychology, Johannes Gutenberg University, Mainz, 55131, Germany; 12Research Division of Mind and Brain, Charité–Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Germany; 13Berlin School of Mind and Brain, Humboldt-Universität zu Berlin, Germany; 14CIBSS—Centre for Integrative Biological Signaling Studies, University of Freiburg, 79104, Germany

Deep learning approaches can uncover complex patterns in data. In particular, variational autoencoders (VAEs) achieve this by a non-linear mapping of data into a low-dimensional latent space. Motivated by an application to psychological resilience in the Mainz Resilience Project (MARP), which features intermittent longitudinal measurements of stressors and mental health, we propose an approach for individualized, dynamic modeling in this latent space. Specifically, we utilize ordinary differential equations (ODEs) and develop a novel technique for obtaining person-specific ODE parameters even in settings with a rather small number of individuals and observations, incomplete data, and a differing number of observations per individual. This technique allows us to subsequently investigate individual reactions to stimuli, such as the mental health impact of stressors. A potentially large number of baseline characteristics can then be linked to this individual response by regularized regression, e.g., for identifying resilience factors. Thus, our new method provides a way of connecting different kinds of complex longitudinal and baseline measures via individualized, dynamic models. The promising results obtained in the exemplary resilience application indicate that our proposal for dynamic deep learning might also be more generally useful for other application domains.

### Statistical Software Development

Chairs: Fabian Scheipl and Gernot Wassmer

A Web-Application to determine statistical optimal designs for dose-response trials, especially with interactions.
Tim Holland-Letz, Annette Kopp-Schneider
German Cancer Research Center DKFZ, Germany

Statistical optimal design theory is well developed, but almost never used in practical applications in fields such as toxicology. For the area of dose response trials we therefore present an R-shiny based web application which calculates D-optimal designs for the most commonly fitted dose response functions, namely the log-logistic and the Weibull function. In this context, the application also generates a graphical representation of the design space (a “design heatmap”). Furthermore, the application allows checking the efficiencies of user specified designs. In addition, uncertainty in regard to the assumptions about the true parameters can be included in the form of average optimal designs. Thus, the user can find a design which is a compromise between rigid optimality and more practical designs which also incorporate specific preferences and technical requirements.

Finally, the app can also be used to compute designs for substance interaction trials between two substances combined in a ray design setup, including an a-priori estimate for the parameters of the combination to be expected under the (Loewe-) additivity assumption.

Distributed Computation of the AUROC-GLM Confidence Intervals Using DataSHIELD
Daniel Schalk1, Stefan Buchka2, Ulrich Mansmann2, Verena Hoffmann2
1Department of Statistics, LMU Munich; 2The Institute for Medical Information Processing, Biometry, and Epidemiology, LMU Munich

Distributed calculation protects data privacy without ruling out complex statistical analyses. Individual data stays in local databases invisible to the analyst who only receives aggregated results. A distributed algorithm that calculates a ROC curve, its AUC estimate with confidence interval is presented to evaluate a therapeutic decision rule. It will be embedded in the DataSHIELD framework [1].

Starting point is the ROC-GLM approach by Pepe et al. [2]. The additivity of the Fisher information matrix, of the score vector, and of the CI proposed by DeLong [3] to aggregate intermediate results allows to design a distributed algorithm to calculate estimates of the ROC-GLM, its AUC, and CI.

We simulate scores and labels (responses) to create AUC values within the range of [0.5, 1]. The size of individual studies is uniformly distributed on [100, 2500] while the percentage of treatment-response covers [0.2,0.8]. Per scenario, 10000 studies are produced. Per study, the AUC is calculated within a non-distributed empiric as well as a distributed setting. The difference in AUC between both approaches is independent of the number of distributed components and is within the range of [-0.019, 0.013]. The boundaries of bootstrapped CIs in the non-distributed empirical setting are close to those in the distributed approach with the CI of DeLong: Range of differences in the lower boundary [-0.015, 0.03]; range of the upper boundary deviations [-0.012, 0.026].

The distributed algorithm allows anonymous multicentric validation of the discrimination of a classification rules. A specific application is the audit use case within the MII consortium DIFUTURE (difuture.de). The multicentric prospective ProVAL-MS study (DRKS: 00014034) on patients with newly diagnosed relapsing-remitting multiple sclerosis provides the data for a privacy-protected validation of a treatment decision score (also developed by DIFUTURE) regarding discrimination between good and insufficient treatment response. The simulation results demonstrate that our algorithm is suitable for the planned validation. The algorithm is implemented in R to be used within DataSHIELD. It will be made publicly available.

[1] Amadou Gaye et al (2014). DataSHIELD: taking the analysis to the data, not the data to the analysis. International Journal of Epidemiology

[2] Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and prediction. Medicine.

[3] DeLong, E. R., DeLong, D. M., and Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach.Biometrics, pages 837–845.

Interactive review of safety data during a data monitoring committee using R-Shiny
Tobias Mütze1, Bo Wang2, Douglas Robinson2
1Statistical Methodology, Novartis Pharma AG, Switzerland; 2Scientific Computing and Consulting, Novartis Pharma AG, Switzerland

In clinical trials it is common that the safety of patients is monitored by a data monitoring committee (DMC) that operates independently of the clinical trial teams. After each review of the accumulating trial data, it is within the DMC’s responsibility to decide on whether to continue or stop the trial. The data are generally presented to DMCs in a static report through tables, listing, and sometimes figures. In this presentation, we share our experiences with supplementing the safety data review with an interactive R-Shiny app. We will first present the layout and content of the app. Then, we outline the advantages of reviewing (safety) data by means of an interactive app compared to the standard review of a DMC report, namely, extensive use of graphical illustrations in addition to tables, ability to quickly change the level of detail, and to switch between study-level data and subject-level data. We argue that this leads to a robust collaborative discussion and a more complete understanding of the data. Finally, we discuss the qualification process itself of an R Shiny app and outline how the learnings may be applied to enhance standard DMC reports

References

[1] Wang, W., Revis, R., Nilsson, M. and Crowe, B., 2020. Clinical Trial Drug Safety Assessment with Interactive Visual Analytics. Statistics in Biopharmaceutical Research, pp.1-12.

[2] Fleming, T.R., Ellenberg, S.S. and DeMets, D.L., 2018. Data monitoring committees: current issues. Clinical Trials, 15(4), pp.321-328.

[3] Mütze, T. and Friede, T., 2020. Data monitoring committees for clinical trials evaluating treatments of COVID-19. Contemporary Clinical Trials, 98, 106154.

[4] Buhr, K.A., Downs, M., Rhorer, J., Bechhofer, R. and Wittes, J., 2018. Reports to independent data monitoring committees: an appeal for clarity, completeness, and comprehensibility. Therapeutic innovation & regulatory science, 52(4), pp.459-468.

An R package for an integrated evaluation of statistical approaches to cancer incidence projection
Maximilian Knoll1,2,3,4, Jennifer Furkel1,2,3,4, Jürgen Debus1,3,4, Amir Abdollahi1,3,4, André Karch5, Christian Stock6,7
1Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Germany; 2Faculty of Biosciences, Heidelberg University, Heidelberg, Germany; 3Clinical Cooperation Unit Radiation Oncology, German Cancer Research Center (DKFZ), Heidelberg, Germany; 4German Cancer Consortium (DKTK) Core Center Heidelberg, Heidelberg, Germany; 5Institute of Epidemiology and Social Medicine, University of Muenster, Muenster, Germany.; 6Institute of Medical Biometry and Informatics (IMBI), University of Heidelberg, Heidelberg, Germany; 7Division of Clinical Epidemiology and Aging Research, German Cancer Research Center (DKFZ), Heidelberg, Germany

Background: Projection of future cancer incidence is an important task in cancer epidemiology. The results are of interest also for biomedical research and public health policy. Age-Period-Cohort (APC) models, usually based on long-term cancer registry data (>20yrs), are established for such projections. In many countries (including Germany), however, nationwide long-term data are not yet available. It is unclear which statistical approach should be recommended for projections using rather short-term data.

Methods: To enable a comparative analysis of the performance of statistical approaches to cancer incidence projection, we developed an R package (incAnalysis), supporting in particular Bayesian models fitted by Integrated Nested Laplace Approximations (INLA). Its use is demonstrated by an extensive empirical evaluation of operating characteristics (bias, coverage and precision) of potentially applicable models differing by complexity. Observed long-term data from three cancer registries (SEER-9, NORDCAN, Saarland) was used for benchmarking.

Results: Overall, coverage was high (mostly >90%) for Bayesian APC models (BAPC), whereas less complex models showed differences in coverage dependent on projection-period. Intercept-only models yielded values below 20% for coverage. Bias increased and precision decreased for longer projection periods (>15 years) for all except intercept-only models. Precision was lowest for complex models such as BAPC models, generalized additive models with multivariate smoothers and generalized linear models with age x period interaction effects.

Conclusion: The incAnalysis R package allows a straightforward comparison of cancer incidence rate projection approaches. Further detailed and targeted investigations into model performance in addition to the presented empirical results are recommended to derive guidance on appropriate statistical projection methods in a given setting.

Using Differentiable Programming for Flexible Statistical Modeling
Maren Hackenberg1, Marlon Grodd1, Clemens Kreutz1, Martina Fischer2, Janina Esins2, Linus Grabenhenrich2, Christian Karagiannidis3, Harald Binder1
1Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Germany; 2Robert Koch Institute, Berlin, Germany; 3Department of Pneumology and Critical Care Medicine, Cologne-Merheim Hospital, ARDS and ECMO Center, Kliniken der Stadt Köln, Witten/Herdecke University Hospital, Cologne, Germany

Differentiable programming has recently received much interest as a paradigm that facilitates taking gradients of computer programs. While the corresponding flexible gradient-based optimization approaches so far have been used predominantly for deep learning or enriching the latter with modeling components, we want to demonstrate that they can also be useful for statistical modeling per se, e.g., for quick prototyping when classical maximum likelihood approaches are challenging or not feasible.

In an application from a COVID-19 setting, we utilize differentiable programming to quickly build and optimize a flexible prediction model adapted to the data quality challenges at hand. Specifically, we develop a regression model, inspired by delay differential equations, that can bridge temporal gaps of observations in the central German registry of COVID-19 intensive care cases for predicting future demand. With this exemplary modeling challenge, we illustrate how differentiable programming can enable simple gradient-based optimization of the model by automatic differentiation. This allowed us to quickly prototype a model under time pressure that outperforms simpler benchmark models.

We thus exemplify the potential of differentiable programming also outside deep learning applications, to provide more options for flexible applied statistical modeling.