Bioinformatics

AI Models for Multi-Modal Data Integration
Holger Fröhlich
Fraunhofer Gesellschaft e.V.

Precision medicine aims for the delivery of the right treatment for the right patients. One important goal is therefore to identify molecular sub-types of diseases, which opens the opportunity for a better targeted therapy of disease in the future. In that context high throughput omics data has been extensively used. However, analysis of one type of omics data alone provides only a very limited view on a complex and systemic disease such as cancer. Correspondingly, parallel analysis of multiple omics data types is needed and employed more and more routinely. However, leveraging the full potential of multi-omics data requires statistical data fusion, which comes along with a number of unique challenges, including differences in data types (e.g. numerical vs discrete), scale, data quality and dimension (e.g. hundreds of thousands of SNPs vs few hundred miRNAs).

In the first part of my talk I will focus on Pathway based Multi-modal AutoEncoders (PathME) as one possible approach for multi-omics data integration. PathME relies on a multi-modal sparse denoising autoencoder architecture to embed multiple omics types that can be mapped to the same biological pathway. We show that sparse non-negative matrix factorization applied to such embeddings result into well discriminated disease subtypes in several cancer types, which show clinically distinct features. Moreover, each of these subtypes can be associated to subtype-specific pathways, and for each of these pathways it is possible to disentangle the influence of individual omics features, hence providing a rich interpretation.

Going one step further in the second part of my talk I will focus on Variational Autoencoder Modular Bayesian Networks (VAMBN) as merger of Bayesian Networks and Variational Autoencoders to model multiple data modalities (including clinical assessment scores), also in a longitudinal manner. I will specifically demonstrate the application of VAMBN for modeling entire clinical studies in Parkinson’s Disease (PD). Since VAMBN is generative the model can be used to simulate synthetic patients, also under counterfactual scenarios (e.g. age shift by 20 years, modification of disease severity at baseline), which could facilitate the design of clinical studies, sharing of data under probabilistic privacy guarantees and eventually allowing for finding “patients-like-me” within a broader, virtually merged meta-cohort.


Evaluation of augmentation techniques for high-dimensional gene expression data for the purpose of fitting artificial neural networks
Magdalena Kircher, Jessica Krepel, Babak Saremi, Klaus Jung
University of Veterinary Medicine Hannover, Foundation, Germany

Background:

High-throughput transcriptome expression data from DNA microarrays or RNA-seq are regularly checked for their ability to classify samples. However, with further densification of transcriptomic data and a low number of observations – due to a lack of available biosamples, prohibitive costs and ethical reasons – the ratio between the number of variables and available observations is usually very large. As a consequence, classifier performance estimated from training data often tends to be overrated and little robust. It has been demonstrated in many applications that data augmentation can improve the robustness of artificial neural networks. Data augmentation on high-dimensional gene expression data has, however, been very little studied so far.

Methods:

We investigate the applicability and capacity of two data augmentation approaches including generative adversarial networks (GAN), which have been widely used for augmenting image datasets. Comparison of augmentation methods is carried out in public example data sets from infection research. Besides neural networks, we evaluate the augmentation techniques on the performance of linear discriminant analysis and support vector machines.

Results and Outlook:

First results of a 10-fold cross validation show increased accuracy, sensitivity, specificity and predictive values when using augmented data sets compared to classifier models based on the original data only. A simple augmentation approach by mixed observations shows a similar performance as the computational more expensive approach with GANs. Further evaluations are currently running to better understand the detailed performance of the augmentation techniques.


NetCoMi: Network Construction and Comparison for Microbiome Data in R
Stefanie Peschel1, Christian L. Müller2,3,4, Erika von Mutius1,5,6, Anne-Laure Boulesteix7, Martin Depner1
1Institute of Asthma and Allergy Prevention, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany; 2Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany; 3Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany; 4Center for Computational Mathematics, Flatiron Institute, New York, USA; 5Dr von Hauner Children’s Hospital, Ludwig-Maximilians-Universität München, Munich, Germany; 6Comprehensive Pneumology Center Munich (CPC-M), Member of the German Center for Lung Research, Munich, Germany; 7Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig-Maximilians-Universität München, Munich, Germany

Background:

Network analysis methods are suitable for investigating the microbial interplay within a habitat. Since microbial associations may change between conditions, e.g. between health and disease state, comparing microbial association networks between groups might be useful. For this purpose, the two networks are constructed separately, and either the resulting associations themselves or the network’s properties are compared between the two groups.

Estimating associations for sequencing data is challenging due to their special characteristics – that is, sparsity with a high number of zeros, high dimensionality, and compositionality. Several association measures taking these features into account have been published during the last decade. Furthermore, several network analysis tools, methods for comparing network properties among two or more groups as well as approaches for constructing differential networks are available in the literature. However, no unifying tool for the whole process of constructing, analyzing and comparing microbial association networks between groups is available so far.

Methods:

We provide the R package „NetCoMi“ implementing this whole workflow starting with a read count matrix originating from a sequencing process, to network construction, up to a statement whether single associations, local network characteristics, the determined clusters, or even the overall network structure differs between the groups. For each of the aforementioned steps, a selection of existing methods suitable for the application on microbiome data is included. Especially the function for network construction contains many different approaches including methods for treating zeros in the data, normalization, computing microbial associations, and sparsifying the resulting association matrix. NetCoMi can either be used for constructing, analyzing and visualizing a single network, or for comparing two networks in a graphical as well as a quantitative manner, including statistical tests.

Results:

We illustrate the application of our package using a real data set from GABRIELA study [1] to compare microbial associations in settled dust from children’s rooms between samples from two study centers. The examples demonstrate how our proposed graphical methods uncover genera with different characteristics (e.g. a different centrality) between the groups, similarities and differences between the clusterings, as well as differences among the associations themselfes. These descriptive findings are confirmed by a quantitative output including a statement whether the results are statistically significant.

[1] Jon Genuneit, Gisela Büchele, Marco Waser, Katalin Kovacs, Anna Debinska, AndrzejBoznanski, Christine Strunz-Lehner, Elisabeth Horak, Paul Cullinan, Dick Heederik, et al.The gabriel advanced surveys: study design, participation and evaluation of bias.Paediatric and Perinatal Epidemiology, 25(5):436–447, 2011.


Comparison of merging strategies for building machine learning models on multiple independent gene expression data sets
Jessica Krepel, Magdalena Kircher, Moritz Kohls, Klaus Jung
University of Veterinary Medicine Hannover, Foundation, Germany

Background:

Microarray experiments and RNA-seq experiments allow the simultaneous collection of gene expression data from several thousand genes which can be used in a wide range of biological questions. Nowadays, there are gene expression data available in public databases for many biological and medical research questions. Oftentimes, several independent studies are performed on the same or similar research question. There are several benefits of combining these studies compared to individual analyses. Several approaches for combining independent data sets of gene expression data have been proposed already in the context of differential gene expression analysis and gene set enrichment analysis. Here, we want to compare different strategies for combining independent data sets for the purpose of classification analysis.

Methods:

We only considered the two-group design, e.g. with class labels diseased and healthy. At different stages of the analysis, the information of the individual studies can be aggregated. We examined three different merging pipelines with regard to the stage of the analysis at which merging is conducted, namely the direct merging of the data sets (strategy A), the merging of the trained classification models (strategy B), and the merging of the classification results (strategy C). We combined the merging pipelines with different methods for classification, linear discriminant analysis (LDA), support vector machines (SVM), and artificial neural networks (ANN). Within each merging strategy, we performed a differential gene expression analysis for dimension reduction to select a set of genes that we then used as feature subset in the classification. We trained and evaluated the classification model on several data subsets in form of a 10-fold cross validation. We first performed a simulation study with pure artificial data, and secondly a study based on a real world data set from the public data repository ArrayExpress that we artificially split into two studies.

Results:

With respect to classification accuracy, we found that the strategy of data merging outperformed the strategy of results merging in most of our simulation scenarios with artificial data. Especially when the number of studies is high and the differentiability between the groups is low, strategy A appears as the best performing one. Strategy A performed particularly better than the other two merging approaches when four independent studies were aggregated compared to scenarios with only two independent studies.


Evaluating the quality of synthetic SNP data from deep generative models under sample size constraints
Jens Nußberger, Frederic Boesel, Stefan Lenz, Harald Binder, Moritz Hess
Universitätsklinikum Freiburg, Germany

Synthetic data such as generated by deep generative models are increasingly considered for exchanging biomedical data, such as single nucleotide polymorphism (SNP) data, under privacy constraints. This requires that the employed model did sufficiently well learn the joint distribution of the data. A major limiting factor here is the number of available empirical observations which can be used for training. Until now, there is little evidence of how well the predominant generative approaches, namely variational autoencoders (VAEs), deep Boltzmann machines (DBMs) and generative adversarial networks (GANs) learn the joint distribution of the objective data under sample size constraints. Using simulated SNP data and data from the 1000 genomes project, we here provide results from an in-depth evaluation of VAEs, DBMs and GANs. Specifically, we investigate, how well pair-wise co-occurrences of variables in the investigated SNP data, quantified as odds ratios (ORs), are recovered in the synthetic data generated by the approaches. For simulated as well as the 1000 genomes SNP data, we observe that DBMs generally can recover structure for up to 300 SNPs. However, we also observe a tendency of over-estimating ORs when the DBMs are not carefully tuned. VAEs generally get the direction and relative strength of pairwise ORs right but generally under-estimate their magnitude. GANs perform well only when larger sample sizes are employed and when there are strong pairwise associations in the data. In conclusion, DBMs are well suited for generating synthetic observations for binary omics data, such as SNP data, under sample size constraints. VAEs perform superior at smaller sample sizes but are limited with respect to learning the absolute magnitude of pairwise associations between variables. GANs require large amounts of training data and likely require a careful selection of hyperparameters.