Data Sharing and Reproducible Research

The Statistical Assessment of Replication Success
Leonhard Held
Epidemiology, Biostatistics and Prevention Institute (EBPI) and Center for Reproducible Science (CRS), University of Zurich

Replicability of research findings is crucial to the credibility of all empirical domains of science. However, there is no established standard how to assess replication success and in practice many different approaches are used. Statistical significance of both the original and replication study is known as the two-trials rule in drug regulation but does not take the corresponding effect sizes into account.

We compare the two-trials rule with the sceptical p-value (Held, 2020), an attractive compromise between hypothesis testing and estimation. This approach penalizes shrinkage of the replication effect estimate compared to the original one, while ensuring that both are also statistically significant to some extent. We describe a recalibration of the procedure as proposed in Held et al (2020), the golden level. The golden level guarantees that borderline significant original studies can only be replicated successfully if the replication effect estimate is larger than the original one. The recalibrated sceptical p-value offers uniform gains in project power compared to the two-trials rule and controls the Type-I error rate except for very small replication sample sizes. An application to data from four large replication projects shows that the new approach leads to more appropriate inferences, as it penalizes shrinkage of the replication estimate compared to the original one, while ensuring that both effect estimates are sufficiently convincing on their own. Finally we describe how the approach can also be used to design the replication study based on specification of the minimum relative effect size to achieve replication success.

Held, Leonhard (2020) A new standard for the analysis and design of replication studies (with discussion). Journal of the Royal Statistical Society, Series A, 183:431–469.

Held, Leonhard and Micheloud, Charlotte and Pawel, Samuel (2020). The assessment of replication success based on relative effect size. https://arxiv.org/abs/2009.07782


Multivariate regression modelling with global and cohort-specific effects in a federated setting with data protection constraints
Max Behrens, Daniela Zöller
University of Freiburg, Germany

Multi-cohort studies are an important tool to study effects on a large sample size and to identify cohort-specific effects. Thus, researchers would like to share information between cohorts and research institutes. However, data protection constraints forbid the exchange of individual-level data between different research institutes. To circumvent this problem, only non-disclosive aggregated data is exchanged, which is often done manually and requires explicit permission before transfer. The framework DataSHIELD enables automatic exchange in iterative calls, but methods for performing more complex tasks such as federated optimisation and boosting techniques are missing.

We propose an iterative optimization of multivariate regression models which condenses global (cohort-unspecific) and local (cohort-specific) predictors. This approach will be solely based on non-disclosive aggregated data from different institutions. The approach should be applicable in a setting with high-dimensional data with complex correlation structures. Nonetheless, the amount of transferred data should be limited to enable manual confirmation of data protection compliance.

Our approach implements an iterative optimization between local and global model estimates. Herein, the linear predictor of the global model will act as a covariate in the local model estimation. Subsequently, the linear predictor of the updated local model is included in the global model estimation. The procedure is repeated until no further model improvement is observed for the local model estimates. In case of an unknown variable structure, our approach can be extended with an iterative boosting procedure performing variable selection for both the global and local model.

In a simulation study, we aim to show that our approach improves both global and local model estimates while preserving the globally found effect structure. Furthermore, we want to demonstrate the approach to grant protected access to a multi-cohort data pool concerning gender sensitive studies. Specifically, we aim to apply the approach to improve upon cohort-specific model estimates by incorporating a global model based on multiple cohorts. We will apply the method to real data obtained in the GESA project, where we combined data from the three large German population-based cohorts GHS, SHIP, and KORA to identify potential predictors for mental health protectories.

In general, all gradient-based methods can be adapted easily to a federated setting under data protection constraints. The here presented method can be used in this setting to perform iterative optimisation and can thus aid in the process of understanding cohort-specific estimates. We provide an implementation in the DataSHIELD framework.


A replication crisis in methodological statistical research?
Anne-Laure Boulesteix1, Stefan Buchka1, Alethea Charlton1, Sabine Hoffmann1, Heidi Seibold2, Rory Wilson2
1LMU Munich, Germany; 2Helmholtz Zentrum Munich, Germany

Statisticians are often keen to analyze the statistical aspects of the so-called “replication crisis”. They condemn fishing expeditions and publication bias across empirical scientific fields applying statistical methods. But what about good practice issues in their own – methodological – research, i.e. research considering statistical methods as research objects? When developing and evaluating new statistical methods and data analysis tools, do statisticians adhere to the good practice principles they promote in fields which apply statistics? I argue that statisticians should make substantial efforts to address what may be called the replication crisis in the context of methodological research in statistics and data science. In the first part of my talk, I will discuss topics such as publication bias, the design and necessity of neutral comparison studies and the importance of appropriate reporting and research synthesis in the context of methodological research.

In the second part of my talk I will empirically illustrate a specific problem which affects research articles presenting new data analysis methods. Most of these articles claim that “the new method performs better than existing methods”, but the veracity of such statements is questionable. An optimistic bias may arise during the evaluation of novel data analysis methods resulting from, for example, selection of datasets or competing methods; better ability to fix bugs in a preferred method; and selective reporting of method variants. This bias is quantitatively investigated using a topical example from epigenetic analysis: normalization methods for data generated by the Illumina HumanMethylation450K BeadChip microarray.


Reproducible bioinformatics workflows: A case study with software containers and interactive notebooks
Anja Eggert, Pal O Westermark
Leibniz Institute for Farm Animal Biology, Deutschland

We foster transparent and reproducible workflows in bioinformatics, which is challenging given their complexity. We developed a new statistical method in the field of circadian rhythmicity, which allows to rigorously determine whether measured quantities such as gene expressions are not rhythmic. Knowledge of no or at most weak rhythmicity may significantly simplify studies, aid detection of abolished rhythmicity, and facilitate selection of non-rhythmic reference genes or compounds, among other applications. We present our solution to this problem in the form of a precisely formulated mathematical statistic accompanied by a software called SON (Statistics Of Non-rhythmicity). The statistical method itself is implemented in the R package “HarmonicRegression”, available on the CRAN repository. However, the bioinformatics workflow is much larger than the statistical test. For instance, to ensure the applicability and validity of the statistical method, we simulated data sets of 20,000 gene expressions over two days, with a large range of parameter combinations (e.g. sampling interval, fraction of rhythmicity, amount of outliers, detection limit of rhythmicity, etc.). Here we describe and demonstrate the use of a Jupyter notebook to document, specify, and distribute our new statistical method and its application to both simulated and experimental data sets. Jupyter notebooks combine text documentation with dynamically editable and executable code and are an implementation of the concept of literate programming. Thus, parameters and code can be modified, allowing both verification of results, as well as instant experimentation by peer reviewers and other users of the science community. Our notebook runs inside a Docker software container, which mirrors the original software environment. This approach avoids the need to install any software and ensures complete long-term reproducibility of the workflow. This bioinformatics workflow allows full reproducibility of our computational work.