Statistical Machine Learning I

Chairs: Matthias Schmid and Thomas Welchowski

Interpretable Machine Learning
Bernd Bischl
Ludwig-Maximilians-Universität München

Adapting Variational Autoencoders for Realistic Synthetic Data with Skewed and Bimodal Distributions
Kiana Farhadyar, Harald Binder
Faculty of Medicine and Medical Center – University of Freiburg, Germany

Background: Passing synthetic data instead of original data to the other researchers is an option when data protection restrictions exist. Such data should preserve the statistical relationships between the variables while protecting privacy. In recent years, deep generative models have allowed for significant progress in the field of synthetic data generation. In particular, variational autoencoders (VAEs) are a popular class of deep generative models. Standard VAEs are typically built around a latent space with a Gaussian distribution and this is a key challenge for VAEs when they encounter more complex data distributions like bimodal or skewed data.

Methods: In this work, we propose a novel method for synthetic data generation that handles bimodal and skewed data as well, while keeping the overall VAE framework. Moreover, this method can generate synthetic data for datasets consisting of both continuous and binary variables. We apply two transformations to convert the data into a form that is more compliant with VAEs. First, we use Box-Cox transformations to transform the skewed distribution to something closer to a symmetric distribution. Then, dealing with potential bimodal data, we employ a power function sgn(x)|x|^p that can transform the data in a way that it has closer peaks and lighter tails. For the evaluation, we use a simulation design data, which is based on a large breast cancer study and The International Stroke Trial (IST) dataset as a real data example.

Results: We show that the pre-transformations can make a considerable improvement in the utility of synthetic data for skewed and bimodal distributions. We investigate this in comparison with standard VAEs and a VAE with an autoregressive implicit quantile network approach (AIQN) and also Generative Adversarial Networks (GAN). We see that our method is the only method that can generate bimodality and the other methods typically generate unimodal distributions. For skewed data, these methods decrease the skewness of synthetic data and make the data closer to a symmetric distribution while our method produces similar skewness to original data and honors the value range of original data better.

Conclusion: In conclusion, we developed a simple method, which adapts VAEs by transformations to handle skewed and bimodal data. Due to its simplicity, it is possible to combine it with many extensions of VAEs. Thus, it becomes feasible to generate high-quality synthetic clinical data for research under data protection constraints.

Statistical power for cell identity detection in deep generative models
Martin Treppner1,2, Harald Binder1,2
1Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg, Germany; 2Freiburg Center of Data Analysis and Modelling, Mathematical Institute – Faculty of Mathematics and Physics, University of Freiburg, Germany

One of the most common applications of single-cell RNA-sequencing experiments is to discover groups of cells with a similar expression profile in an attempt to define cell identities. The similarity of these expression profiles is typically examined in a low-dimensional latent space, which can be learned by deep generative models such as variational autoencoders (VAEs). However, the quality of representations in VAEs varies greatly depending on the number of cells under study, which is also reflected in the assignment to specific cell identities. We propose a strategy to answer what number of cells is needed so that a pre-specified percentage of the cells in the latent space is well represented.

We train VAEs on a varying number of cells and evaluate the learned representations‘ quality by use of the estimated log-likelihood lower bound of each cell. The distribution arising from the values of the log-likelihoods are then compared to a permutation-based distribution of log-likelihoods. We generate the permutation-based distribution by randomly drawing a small subset of cells before training the VAE and permuting each gene’s expression values among these randomly drawn cells. By doing so, we ensure that the latent representation’s overall structure is preserved, and at the same time, we obtain a null distribution for the log-likelihoods. We then compare log-likelihood distributions for different numbers of cells. We also harness the properties of VAEs by artificially increasing the number of samples in small datasets by generating synthetic data and combining them with the original pilot datasets.

We demonstrate performance on varying sizes of subsamples of the Tabula Muris scRNA-seq dataset from the brain of seven mice processed with the SMART-Seq2 protocol. We show that our approach can be used to plan cell numbers for single-cell RNA-seq experiments, which might improve the reliability of downstream analyses such as cell identity detection and inference of developmental trajectories.

Individualizing deep dynamic models for psychological resilience data
Göran Köber1,2, Shakoor Pooseh2,3, Haakon Engen4, Andrea Chmitorz5,6,7, Miriam Kampa5,8,9, Anita Schick4,10, Alexandra Sebastian6, Oliver Tüscher5,6, Michèle Wessa5,11, Kenneth S.L. Yuen4,5, Henrik Walter12,13, Raffael Kalisch4,5, Jens Timmer2,3,14, Harald Binder1,2
1Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Germany; 2Freiburg Center of Data Analysis and Modelling (FDM), University of Freiburg, Freiburg, 79104, Germany; 3Institute of Physics, University of Freiburg, 79104, Germany; 4Neuroimaging Center (NIC), Focus Program Translational Neuroscience (FTN), Johannes Gutenberg University Medical Center, Mainz, 55131, Germany; 5Leibniz Institute for Resilience Research (LIR), Mainz, 55122, Germany; 6Department of Psychiatry and Psychotherapy, Johannes Gutenberg University Medical Center, Mainz, 55131, Germany; 7Faculty of Social Work, Health and Nursing, University of Applied Sciences Esslingen, Esslingen, 73728, Germany; 8Department of Clinical Psychology, University of Siegen, 57076, Germany; 9Bender Institute of Neuroimaging (BION), Department of Psychology, Justus Liebig University, Gießen, 35394, Germany; 10Department of Public Mental Health, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Germany; 11Department of Clinical Psychology and Neuropsychology, Institute of Psychology, Johannes Gutenberg University, Mainz, 55131, Germany; 12Research Division of Mind and Brain, Charité–Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Germany; 13Berlin School of Mind and Brain, Humboldt-Universität zu Berlin, Germany; 14CIBSS—Centre for Integrative Biological Signaling Studies, University of Freiburg, 79104, Germany

Deep learning approaches can uncover complex patterns in data. In particular, variational autoencoders (VAEs) achieve this by a non-linear mapping of data into a low-dimensional latent space. Motivated by an application to psychological resilience in the Mainz Resilience Project (MARP), which features intermittent longitudinal measurements of stressors and mental health, we propose an approach for individualized, dynamic modeling in this latent space. Specifically, we utilize ordinary differential equations (ODEs) and develop a novel technique for obtaining person-specific ODE parameters even in settings with a rather small number of individuals and observations, incomplete data, and a differing number of observations per individual. This technique allows us to subsequently investigate individual reactions to stimuli, such as the mental health impact of stressors. A potentially large number of baseline characteristics can then be linked to this individual response by regularized regression, e.g., for identifying resilience factors. Thus, our new method provides a way of connecting different kinds of complex longitudinal and baseline measures via individualized, dynamic models. The promising results obtained in the exemplary resilience application indicate that our proposal for dynamic deep learning might also be more generally useful for other application domains.