Causal Inference and Statistical Methods for Epidemiology

Causal inference methods for small non-randomized studies: Methods and an application in COVID-19
Sarah Friedrich, Tim Friede
University Medical Center Göttingen, Germany

The usual development cycles are too slow for the development of vaccines, diagnostics and treatments in pandemics such as the ongoing SARS-CoV-2 pandemic. Given the pressure in such a situation, there is a risk that findings of early clinical trials are overinterpreted despite their limitations in terms of size and design. Motivated by a non-randomized open-label study investigating the efficacy of hydroxychloroquine in patients with COVID-19, we describe in a unified fashion various alternative approaches to the analysis of non-randomized studies. A widely used tool to reduce the impact of treatment-selection bias are propensity score (PS) methods and g-computation. Conditioning on the propensity score allows one to replicate the design of a randomized controlled trial, con­ditional on observed covariates. Moreover, doubly robust estimators provide additional advantages. Here, we investigate the properties of propensity score based methods including three variations of doubly robust esti­mators in small sample settings, typical for early trials, in a simulation study.

Testing Instrument Validity in Multivariable Mendelian Randomisation
Maximilian Michael Mandl1, Anne-Laure Boulesteix1, Stephen Burgess2, Verena Zuber3
1Ludwig-Maximilians-Universität München; 2University of Cambridge; 3Imperial College London

Identification of causal effects in biomedical sciences is a challenging task. Most causal inference methods rely on specific assumptions which in practice may be unrealistic and too restrictive. However, Mendelian Randomisation (MR) is an instrumental variable approach that makes use of genetic variants to infer a causal effect of a risk factor on an outcome. Due to the randomisation of the genetic variants during meiosis, these are predestined instrumental variables that have the potential to naturally meet the restrictive methodological requirements. Thus, causal effects can be consistently inferred even if unobserved confounders are present. Obviously, this setting still requires the genetic variants to be independent of the outcome conditional on the risk factor and unobserved confounders, which is known as the exclusion-restriction assumption (ERA). Violations of this assumption, i.e. the effect of the instrumental variables on the outcome through a different path than the risk factor included in the model, can be caused by pleiotropy, which is a common phenomenon in human genetics. As an extension to the standard MR approach, multivariable MR includes multiple potential risk factors in one joint model accounting for measured pleiotropy. Genetic variants which deviate from the ERA appear as outliers to the MR model fit and can be detected by general heterogeneity statistics proposed in the literature. In MR analysis these are often inflated due to heterogeneity of how genetic variants exert their downstream effect on the exposures of interest, which impedes detection of outlying instruments using the traditional methods.

Removing valid instruments or keeping invalid instruments in the MR model may lead to a bias of the causal effect estimates and false positive findings. As different heterogeneity measures lead to a variety of conclusions with regard to outlying instruments, researchers face a typical decision problem, also known as researcher degrees of freedom. These free choices in the selection of valid instruments can lead to serious problems like fishing for significance.

Firstly, we demonstrate the impact of outliers and how arbitrary choices in the selection of instrumental variables can induce false positive findings in realistic simulation studies and in the analysis of real data investigating the effect of blood lipids on coronary heart disease and Alzheimer’s disease. Secondly, we propose a method that corrects for overdispersion of the heterogeneity statistics in MR analysis by making use of the estimated inflation factor to correctly remove outlying instruments and therefore accounting for pleiotropic effects.

Causal Discovery with Incomplete Cohort Data
Janine Witte1,2, Ronja Foraita1, Ryan M. Andrews1, Vanessa Didelez1,2
1Leibniz Institute for Prevention Research and Epidemiology – BIPS, Bremen, Germany; 2University of Bremen, Germany


Cohort studies in health research often involve the collection of large numbers of variables over a period of time. They thus form the ideal basis for exploring the relationships among many variables simultaneously, e.g. by methods such as constraint-based causal discovery. These methods aim at inferring a causal graph, combining causal assumptions with statistical tests for conditional independence. A typical problem in practice are missing values. Simple methods for dealing with incomplete data, such as list-wise deletion and mean imputation, can lead to inefficient and biased inference.


We consider test-wise deletion and multiple imputation for causal discovery. The former applies each conditional independence test to the records containing complete information on the variables used for the test. For multiple imputation, the missing values are imputed M>1 times, the conditional independence test is run on each of the M data sets, and the test results are combined using an appropriate pooling method. We implemented multiple imputation and pooling procedures for causal discovery with continuous, discrete and mixed data. We then compared the performance of test-wise deletion and multiple imputation in scenarios with different missing data patterns typical for cohort data.


Both test-wise deletion and multiple imputation rely on untestable assumptions about the missingness mechanism. Test-wise deletion is computationally simple and can in principle be combined with any conditional independence test. However, it ignores possibly valuable information in partially observed records, hence the power can be low. Multiple imputation has the potential to exploit more information, and outperformed test-wise deletion in several of our simulation scenarios. The simulations also showed, however, that conditional independence testing after multiple imputation is impaired by small sample sizes and large numbers of conditioning variables, especially when the variables are categorical or mixed. Care needs to be taken when choosing the imputation models, as multiple imputation may break down when the number of variables is large, as is typical for cohort studies. Preliminary results suggest that drop-out is best dealt with using test-wise deletion.


Both test-wise deletion and multiple imputation are promising strategies for dealing with missing values in causal discovery, each with their own advantages. Multiple imputation can potentially exploit more information than test-wise deletion, but requires some care when choosing the imputation models. R code for combining test-wise deletion and multiple imputation with different conditional independence tests is available.

What Difference Does Multiple Imputation Make In Longitudinal Modeling of EQ-5D-5L Data: Empirical Analyses of Two Datasets
Lina Maria Serna Higuita1, Inka Roesel1, Fatima Al Sayah2, Maresa Buchholz3, Ines Buchholz3, Thomas Kohlmann3, Peter Martus1, You-Shan Feng1
1Institute for Clinical Epidemiology and Applied Biostatistics, Medical University of Tübingen, Tübingen, Germany; 2Alberta PROMs and EQ-5D Research and Support Unit (APERSU), School of Public Health, University of Alberta, Alberta, Canada; 3Institute for Community Medicine, Medical University Greifswald, Greifswald, Germany

Background: Although multiple imputation (MI) is the state-of-the-art method for managing missing data, it is not clear how missing values in multi-item instruments should be handled, e.g. MI at item or at score level. In addition, longitudinal data analysis techniques such as mixed models (MM) may be equally valid. We therefore explored the differences in modeling the scores of a health-related quality of life questionnaire (EQ-5D-5L) using MM with and without MI at item and score level, in two real data sets.

Methods: We explored 1) Agreement analysis using the observed missing data patterns of EQ-5D-5L responses for a Canadian study, which included patients with type-II diabetes at three time points (Alberta’s Caring for Diabetes (ABCD); n=2,040); and 2) Validation analysis using simulated missing patterns for complete cases of a German multi-center study of rehabilitation patients pre- and post-treatment (German Rehabilitation (GR); n=691). Two missing mechanisms (MCAR and MAR) at 8 percentages of missings (5%-65%) were applied to the GR data. Approaches to handle missing EQ-5D-5L scores for all datasets were: Approach-1) MM using respondents with complete cases, approach-2) MM using all available data, approach-3) MM after MI of the EQ-5D-5L scores, and approach-4) MM after MI of EQ-5D-5L items. Agreement was assessed by comparing predicted values and regression coefficients. Validation was examined using mean squared errors (MSE) and standard errors (SE) compared to the original dataset.


Agreement: The ABCD respondents with missing EQ-5D-5L (40.3%) had significantly poorer self-rated health, and lower academic achievement. All 4 approaches estimated similar baseline scores (ABCD≈0.798). At follow up, approach-1 resulted in the highest mean scores (ABCD=0.792) while approach-4 produced the lowest scores (ABCD=0.765). The largest slope of change was observed for approach-4 (visit1–visit3: -0.027), while the smallest slopes were observed for approach-2 (visit3–visit1:-0.011).

Validation: SE and MSE increased with increasing percentages of simulated missing GR data. All approaches showed similar SE and MSE (SE: 0.006-0.011; MSE: 0.032-0.033), however approach-4 showed in the most inaccurate predictions, underestimating the score.

Discussion: In these data, complete case analyses overestimated the scores and MM after MI by items yielded the lowest scores. As there was no loss of accuracy, MM without MI, when baseline covariates are complete, might be the most parsimonious choice to deal with missing data. However, MI may be needed when baseline covariates are missing and/or more than two timepoints are considered.

Exploring missing patterns and missingness mechanisms in longitudinal patient-reported outcomes using data from a non-randomized controlled trial study
Pimrapat Gebert1,2,3, Daniel Schindel1, Johann Frick1, Liane Schenk1, Ulrike Grittner2,3
1Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Institute of Medical Sociology and Rehabilitation Science; 2Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Institute of Biometry and Clinical Epidemiology; 3Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Str. 2, 10178 Berlin, Germany

Missing data mechanism plays an important role in the context of handling and analyzing data subject to missings. Longitudinal patient-reported outcome measures (PROMs) are usually far from complete, especially in seriously ill patients. To choose an appropriate strategy for handling missing data, most statistical approaches require knowledge about missingness patterns and assumptions about the type of missing data mechanism. We demonstrated how to explore the missingness patterns and mechanisms using PROMs data including global health status/QoL (GH/QoL) in the EORTC QLQ-C30, Patient reaction assessment (PRA-D), The Revised Illness Perception Questionnaire (IPQ-R), German modified version of the Autonomy Preference Index (API-DM), Decision Conflict Scale (DCS), and European health literacy survey (HLS-EU-Q6) from the Oncological Social Care Project (OSCAR) study. Linear random-effects pattern-mixture models were performed for identifying missing not at random (MNAR) for each pattern. We found that the missing data on the GH/QoL in the EORTC QLQ-C30 could be assumed as MNAR in missing data due to massive worsening of health status and death. However, there was no evidence of MNAR in any other PROMs measures. Although determining the true missing data mechanism is impossible, a pattern-mixture model can be useful in evaluating the effects of informative missingness in longitudinal PROMs.