Internal validation for descriptive clustering of gene expression data
Anastasiia Holovchak (Bernd-Streitberg-Preis)
LMU Munich, Germany
Cluster algorithms are often used to analyse gene expression data to partition the genes into homogenous groups based on their expression levels across patients. In practice, one is confronted with a large variety of clustering algorithms and it is often unclear which should be selected. A common procedure consists of testing different algorithms with several input parameters and evaluating them with appropriate internal cluster validation indices. However, it is again unclear which of these indices should be selected.
In this work, I conduct a study that investigates the stability of four internal cluster validation indices (Calinski-Harabasz index, Davies-Bouldin index, Dunn Index, and Average Silhouette Width criterion), in particular their ability to identify clusterings that replicate on independent test data. For the purpose of this study, an example gene expression data set is repeatedly split into a training and a test data set. Several commonly used clustering algorithms such as K-means, agglomerative clustering algorithms (Single Linkage, Complete Linkage, and Average Linkage), and spectral clustering algorithm are applied to the training data. The resulting clusterings are assessed using the four internal validation indices under consideration. The cluster methods are then applied to the test data and the similarity between the index values for the clusterings on the training and on the test data is assessed. I analyse whether the cluster algorithms and input parameters that are indicated as the best choices by the internal validation indices on the training data are also the best choices on the test data. Moreover, the internal validation indices are used to choose the best clustering on the training data and the stability of this selection process is investigated by applying the selected algorithm/parameter setting on the test data (as measured through the adjusted Rand index).
The results may guide the selection of appropriate indices in the considered context of gene expression data. For example, in this study the Dunn index yields very unstable results in terms of the selection of the best input parameter, which can be seen as an inconvenience. In conclusion, the investigated internal cluster validation indices show very different behaviours and one should not put much confidence in a single validation index unless there is evidence – from the literature or from own investigations such as the one presented in this thesis – that it yields meaningful replicable results in the considered context.
Model selection characteristics when using MCP-Mod for dose-response gene expression data
Julia Christin Duda (Bernd-Streitberg-Preis)
TU Dortmund University, Germany
Classical approaches in clinical dose-finding trials rely on pairwise comparisons between doses and placebo. A methodological improvement is the MCP-Mod (Multiple Comparison Procedure and Modeling) approach, originally developed for Phase II trials. MCP-Mod combines multiple comparisons with modeling approaches in a multistage procedure. First, for a set of pre-specified candidate models, it is tested if any dose-response signal is present. Second, considering models with detected signal, either the best model is selected to fit the dose-response curve or model averaging is performed.
We extend the scope of application for MCP-Mod to in-vitro gene expression data and assess its characteristics regarding model selection for concentration gene expression curves. Precisely, we apply MCP-Mod on single genes of a high-dimensional gene expression data set, where human embryonic stem cells were exposed to eight concentration levels of the compound valproic acid (VPA). As candidate models we consider the sigmoid Emax (four-parameter log-logistic), linear, quadratic, Emax, exponential and beta model. Through simulations, we investigate the impact of omitting one or more models from the candidate model set to uncover possibly superfluous models and the precision and recall rates of selected models. Measured by the AIC, all models perform best for a considerable number of genes. For less noisy cases the popular sigmoid Emax model is frequently selected. For more noisy data, often simpler models like the linear model are selected, but mostly without relevant performance advantage compared to the second-best model. Also, the commonly used Emax model has an unexpected low performance.
Temporal Dynamics in Generative Models
Maren Hackenberg (Bernd-Streitberg-Preis)
Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Germany
Uncovering underlying development patterns in longitudinal biomedical data is a first step towards understanding disease processes, but is complicated by the sparse time grid and individual-specific development patterns that often characterize such data. In epidemiological cohort studies and clinical registries, we are facing the question of what can be learned from the data in an early phase of the study, when only a baseline characterization and one follow-up measurement are available. Specifically, we considered a data scenario where an extensive characterisation is available at a baseline time point for each individual, but only a smaller subset of variables is measured again at an individually differing second time point, resulting in a very sparse (only two time points) and irregular time grid.
Inspired by recent advances that allow to combine deep learning with dynamic modeling, we employed a generative deep learning model to capture individual dynamics in a low-dimensional latent representation as solutions of ordinary differential equations (ODEs). Here, the variables measured only at baseline are used to infer individual-specific ODE parameters.
Additionally, we enriched the information of each individual by linking groups of individuals with similar underlying trajectories, which then serve as proxy information on the common temporal dynamics. Irregular spacing in time can thus be used to gain more information on individual dynamics by leveraging individuals’ similarity. Using simulated data, we showed that the model can recover individual trajectories from linear and non-linear ODE systems with two and four unknown parameters and infer groups of individuals with similar trajectories. The results illustrate that dynamic deep learning approaches can be adapted to such small data settings to provide an individual-level understanding of the dynamics governing individuals’ developments.
Discrete Subdistribution Hazard Models
Moritz Berger (Gustav-Adolf-Lienert-Preis)
Rheinische Friedrich-Wilhelms-Universität Bonn, Germany
In many clinical and epidemiological studies the interest is in the analysis of the time T until the occurrence of an event of interest j that may occur along with one or more competing events. This requires suitable techniques for competing risks regression. The key quantity to describe competing risks data is the cumulative incidence function, which is defined in terms of the probability of experiencing j at or before time t.
A popular modeling approach for the cumulative incidence function is the proportional subdistribution hazard model by Fine and Gray (1999), which is a direct modeling approach for the cumulative incidence function of one specific event of interest. A limitation of the subdistribution hazard model is that it assumes continuously measured event times. In practice, however, the exact (continuous) event times are often not recorded. Instead, it may only be known that the events occurred between pairs of consecutive points in time (i.e., within pre-specified follow-up intervals). In these cases, time is measured on a discrete scale.
To address this issue, a technique for modeling subdistribution hazards with right-censored data in discrete time is proposed. The method is based on a weighted maximum likelihood estimation scheme for binary regression and results in consistent and asymptotically normal estimators of the model parameters. In addition, a set of tools to assess the calibration of discrete subdistribution hazard models is developed. They consist of a calibration plot for graphical assessments as well as a recalibration model including tests on calibration-in-the-large and refinement.
The modeling approach is illustrated by an analysis of nosocomial pneumonia in intensive care patients measured on a daily basis.
Netboost: Network Analysis Improves High-Dimensional Omics Analysis Through Local Dimensionality Reduction
Pascal Schlosser (Gustav-Adolf-Lienert-Preis)
Institute of Genetic Epidemiology, Faculty of Medicine and Medical Center – University of Freiburg, Germany
State-of-the art selection methods fail to identify weak but cumulative effects of features found in many high-dimensional omics datasets. Nevertheless, these features play an important role in certain diseases. We present Netboost, a three-step dimension reduction technique. First, a boosting- or Spearman-correlation-based filter is combined with the topological overlap measure to identify the essential edges of the network. Second, sparse hierarchical clustering is applied on the selected edges to identify modules and finally module information is aggregated by the first principal components. We demonstrate the application of the newly developed Netboost in combination with CoxBoost for survival prediction of DNA methylation and gene expression data from 180 acute myeloid leukemia (AML) patients and show, based on cross-validated prediction error curve estimates, its prediction superiority over variable selection on the full dataset as well as over an alternative clustering approach. The identified signature related to chromatin modifying enzymes was replicated in an independent dataset, the phase II AMLSG 12-09 study. In a second application we combine Netboost with Random Forest classification and improve the disease classification error in RNA-sequencing data of Huntington’s disease mice. Netboost is a freely available Bioconductor R package for dimension
Estimands and Causality
Johns Hopkins Bloomberg School of Public Health, USA
Closing: Andreas Faldum, Werner Brannath / Annette Kopp-Schneider
Jan Beyersmann (Ulm University), Oliver Kuß (Düsseldorf), Andreas Wienke (Halle)
Do we still need hazard ratios? (I)
German Diabetes Center, Leibniz Institute for Diabetes Research at Heinrich Heine University Düsseldorf, Institute for Biometrics and Epidemiology
It is one of the phenomenons in biostatistics that regression models for continuous, binary, nominal, or ordinal outcomes almost completely rely on parametric modelling, whereas survival or time-to-event outcomes are mainly analyzed by the Proportional Hazards (PH) model of Cox, which is an essentially non-parametric method. The Cox model has become one of the most used statistical models in applied research and the original article from 1972 ranks below the top 100 papers (in terms of citation frequency) across all areas of science.
However, the Cox model and the hazard ratio have also been criticized recently. For example, researchers have been warned to use the magnitude of the HR to describe the magnitude of the relative risk, because the hazard ratio is a ratio of rates, and not one of risks. Hazard ratios, even in randomized trials, have a built-in “selection bias”, because they are conditional measures, conditioning at each time point on the set of observations which is still under risk. Finally, the hazard ratio has been criticized for being non-collapsible. That is, adjusting for a covariate that is associated with the event will in general change the HR, even if this covariate is not associated with the exposure, that is, is no confounder.
In view of these disadvantages it is surprising that parametric survival models are not preferred over the Cox model. These existed long before the Cox model, are easier to comprehend, estimate, and communicate, and, above all, do not have any of the disadvantages mentioned.
Do we still need hazard ratios? (II)
Ulm University, Germany
The answer to the question whether we need hazard ratios depends to a good deal on the answer to the question what we need hazards for. Censoring plays a key role. Censoring makes survival and event history analysis special. One important consequence is that less customized statistical techniques will be biased when applied to censored data. Another important consequence is that hazards remain identifiable under rather general censoring mechanisms. In this talk, I will demonstrate that there is a Babylonian confusion on „independent censoring“ in the textbook literature, which is a worry in its own right. Event-driven trials in pharmaceutical research or competing risks are two examples where the textbook literature often goes haywire, censoring-wise. It is a small step from this mess to misinterpretations of hazards, a challenge not diminished when the aim is a causal interpretation. Causal reasoning, however, appears to be spearheading the current attack on hazards and their ratios.
In philosophy, causality has pretty much been destroyed by David Hume. This does not imply that statisticians should avoid causal reasoning, but it might suggest some modesty. In fact, statistical causality is mostly about interventions, and a causal survival analysis often aims at statements about the intervention „do(no censoring)“, which, however, is not what identifiability of hazards is about. The current debate about estimands (in time-to-event trials) is an example where these issues are hopelessly mixed up.
The aim of this talk is to mix it up a bit further or, perhaps, even shed some light. Time permitting, I will illustrate matters using g-computation in the form of a causal variant of the Aalen-Johansen-estimator.
Open questions to genetic epidemiologists
Universität zu Lübeck, Germany
Given the rapid pace with which genomics and other ‐ omics disciplines are evolving, it is sometimes necessary to shift down a gear to consider more general scientific questions. In this line, we can formulate a number of questions for genetic epidemiologists to ponder on. These cover the areas of reproducibility, statistical significance, chance findings, precision medicine and overlaps with related fields such as bioinformatics and data science. Importantly, similar questions are being raised in other biostatistical fields. Answering these requires to think outside the box and to learn from other, related, disciplines. From that, possible hints at responses are presented to foster the further discussion of these topics.
Pgainsim: A method to assess the mode of inheritance for quantitative trait loci in genome-wide association studies
Nora Scherer1, Peggy Sekula1, Peter Pfaffelhuber2, Anna Köttgen1, Pascal Schlosser1
1Institute of Genetic Epidemiology, Faculty of Medicine and Medical Center – University of Freiburg, Germany; 2Faculty of Mathematics and Physics, University of Freiburg, Germany
Background: When performing genome-wide association studies (GWAS) conventionally an additive genetic model is used to explore whether a SNP is associated with a quantitative trait regardless of the actual mode of inheritance (MOI). Recessive and dominant genetic models are able to improve statistical power to identify non-additive variants. Moreover, the actual MOI is of interest for experimental follow-up projects. Here, we extend the concept of the p-gain statistic  to decide whether one of the three models provides significantly more information than the others.
Methods: We define the p-gain statistic of a genetic model by the comparison of the association p-value of the model with the smaller of the two p-values of the other models. Considering the p-gain as a random variable depending on a trait and a SNP in Hardy-Weinberg equilibrium under the null hypothesis of no genetic association we show that the distribution of the p-gain statistic depends only on the allele frequency (AF).
To determine critical values where the opposing modes can be rejected, we developed the R-package pgainsim (https://github.com/genepi-freiburg/pgainsim). First, the p-gain is simulated under the null hypothesis of no genetic association for a user-specified study size and AF. Then the critical values are derived as the observed quantiles of the empirical density of the p-gain. For applications with extensive multiple testing, the R-package provides an extension of the empirical critical values by a log-linear interpolation of the quantiles.
Results: We tested our method in the German Chronic Kidney Disease study with urinary concentrations of 1,462 metabolites with the goal to identify non-additive metabolite QTLs. For each metabolite we conducted a GWAS under the three models and identified 119 independent mQTLs for which pval_rec or pval_dom<4.6e-11 and pval_add>min(pval_rec,pval_dom). For 38 of these, the additive modelling was rejected based on the p-gain statistics after a Bonferroni adjustment for 1 Mio*549*2 tests. These included the LCT locus with a known dominant MOI, as well as several novel associations. A simulation study for additive and recessive associations with varying effect sizes evaluating false positive and false negative rates of the approach is ongoing.
Conclusion: This new extension of the p-gain statistic allows for differentiating MOIs for QTLs considering their AF and the study sample size, even in a setting with extensive multiple testing.
 Petersen, A. et al. (2012) On the hypothesis-free testing of metabolite ratios in genome-wide and metabolome-wide association studies. BMC Bioinformatics 13, 120.
Genome-wide conditional independence testing with machine learning
Marvin N. Wright1, David S. Watson2,3
1Leibniz Institute for Prevention Research and Epidemiology – BIPS, Bremen, Germany; 2Oxford Internet Institute, University of Oxford, Oxford, UK; 3Queen Mary University of London, London, UK
In genetic epidemiology, we are facing extremely high dimensional data and complex patterns such as gene-gene or gene-environment interactions. For this reason, it is promising to use machine learning instead of classical statistical methods to analyze such data. However, most methods for statistical inference with machine learning test against a marginal null hypothesis and by that cannot handle correlated predictor variables.
Building on the knockoff framework of Candès et al. (2018), we propose the conditional predictive impact (CPI), a provably consistent and unbiased estimator of a variables‘ association with a given outcome, conditional on a reduced set of predictor variables. The method works in conjunction with any supervised learning algorithm and loss function. Simulations confirm that our inference procedures successfully control type I error and achieve nominal coverage probability with greater power than alternative variable importance measures and other nonparametric tests of conditional independence. We apply our method to a gene expression dataset on breast cancer. Further, we propose a modification which avoids the computation of the high-dimensional knockoff matrix and is computationally feasible on data from genome-wide association studies.
Candès, E., Fan, Y., Janson, L. and Lv, J. (2018). Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection. J Royal Stat Soc Ser B Methodol 80:551–577
The key distinction between Association and Causality exemplified by individual ancestry proportions and gallbladder cancer risk in Chileans
Justo Lorenzo Bermejo, Linda Zollner
Statistical Genetics Research Group, Institute of Medical Biometry and Informatics, University of Heidelberg, Germany
Background: The translation of findings from observational studies into improved health policies requires further investigation of the type of relationship between the exposure of interest and particular disease outcomes. Observed associations can be due not only to underlying causal effects, but also to selection bias, reverse causation and confounding.
As an example, we consider the association between the proportion of Native American ancestry and the risk of gallbladder cancer (GBC) in genetically admixed Chileans. Worldwide, Chile shows the highest incidence of GBC, and the risk of this disease has been associated with the individual proportion of Native American – Mapuche ancestry. However, Chileans with large proportions of Mapuche ancestry live in the south of the country, have poorer access to the health system and could be exposed to distinct risk factors. We conducted a Mendelian Randomization (MR) study to investigate the causal relationship “Mapuche ancestry → GBC risk”.
Methods: To infer the potential causal effect of specific risk factors on health-related outcomes, MR takes advantage of the random inheritance of genetic variants and utilizes instrumental variables (IVs):
1. associated with the exposure of interest
2. independent of possible confounders of the association between the exposure and the outcome
3. independent of the outcome given the exposure and the confounders
Given the selected IVs meet the above assumptions, various MR approaches can be used to test causality, for example the inverse variance weighted (IVW) method.
In our example, we took advantage of ancestry informative markers (AIMs) with distinct allele frequencies in Mapuche and other components of the Chilean genome, namely European, African and Aymara-Quechua ancestry. After checking that the AIMs fulfilled the required assumptions, we utilized them as IVs for the individual proportion of Mapuche ancestry in two-sample MR (sample 1: 1,800 Chileans from the whole country, sample 2: 250 Chilean case-control pairs).
Results: We found strong evidence for a causal effect of Mapuche ancestry on GBC risk: IVW OR per 1% increase in the Mapuche proportion 1.02, 95%CI (1.01-1.03), Pval = 0.0001. To validate this finding, we performed several sensitivity analyses including radial MR and different combinations of genetic principal components to rule out population stratification unrelated to Mapuche ancestry.
Conclusion: Causal inference is key to unravel disease aetiology. In the present example, we demonstrate that Mapuche ancestry is causally linked to GBC risk. This result can now be used to refine GBC prevention programs in Chile.