Nonparametric Statistics and Multivariate Analysis

Chairs: Frank Konietschke and Markus Pauly


Marginalized Frailty-Based Illness-Death Model: Application to the UK-Biobank Survival Data
Malka Gorfine
Tel Aviv University, Israel

The UK Biobank is a large-scale health resource comprising genetic, environmental, and medical information on approximately 500,000 volunteer participants in the United Kingdom, recruited at ages 40–69 during the years 2006–2010. The project monitors the health and well-being of its participants. This work demonstrates how these data can be used to yield the building blocks for an interpretable risk-prediction model, in a semiparametric fashion, based on known genetic and environmental risk factors of various chronic diseases, such as colorectal cancer. An illness-death model is adopted, which inherently is a semi-competing risks model, since death can censor the disease, but not vice versa. Using a shared-frailty approach to account for the dependence between time to disease diagnosis and time to death, we provide a new illness-death model that assumes Cox models for the marginal hazard functions. The recruitment procedure used in this study introduces delayed entry to the data. An additional challenge arising from the recruitment procedure is that information coming from both prevalent and incident cases must be aggregated. Lastly, we do not observe any deaths prior to the minimal recruitment age, 40. In this work, we provide an estimation procedure for our new illness-death model that overcomes all the above challenges.


Distribution-free estimation of the partial AUC in diagnostic studies
Maximilian Wechsung
Charité – Universitätsmedizin Berlin, Germany

The problem of partial area under the curve (pAUC) estimation arises in diagnostic studies in which not the whole receiver operating characteristic (ROC) curve of a diagnostic test with continuous outcome can be evaluated. Typically, the investigator is bound by economical as well as ethical considerations to analyze only that part of the ROC curve which includes true positive rates and false positive rates above and below certain thresholds, respectively. The pAUC is the area under this partial ROC curve. It can be used to evaluate the performance of a diagnostic test with continuous outcome. In our talk, we consider a distribution-free estimator of the pAUC and establish its asymptotic distribution. The results can be used to construct statistical tests to compare the performance of different diagnostic tests.


Ranking Procedures for the Factorial Repeated Measures Design with Missing Data – Estimation, Testing and Asymptotic Theory
Kerstin Rubarth, Frank Konietschke
Charité Berlin, Germany

A commonly used design in health, medical and biomedical research is the repeated measures design. Often, a parametric model is used for the analysis of such data. However, if sample size is rather small or if data is skewed or is on an ordinal scale, a nonparametric approach would fit the data better than a classic parametric approach, e.g. linear mixed models. Another issue, that naturally arises when dealing with clinical or pre-clinical data, is the occurrence of missing data. Most methods can only use a complete data set, if no imputation technique is applied. The newly developed ranking procedure is a flexible method for general non-normal, ordinal, ordered categorical and even binary data and uses in case of missing data all available information instead of only the information obtained from complete cases. The hypotheses are defined in terms of the nonparametric relative effect and can be tested by using quadratic test procedures as well as the multiple contrast test procedure. Additionally, the framework allows for the incorporation of clustered data within the repeated measurements. An example for clustered data are animal studies, where several animals share the same cage and are therefore clustered within a cage. Simulation studies indicate a good performance in terms of the type-I error rate and the power under different alternatives with a missing rate up to 30%, also under non-normal data. A real data example illustrates the application of the proposed methodology.


A cautionary tale on using imputation methods for inference in a matched pairs design.
Burim Ramosaj, Lubna Amro, Markus Pauly
TU Dortmund University, Germany

Imputation procedures in biomedical fields have turned into statistical practice, since further analyses can be conducted ignoring the former presence of missing values. In particular, non-parametric imputationschemes like the random forest or a combination with the stochastic gradient boosting have shown favorable imputation performance compared to the more traditionally used MICE procedure. However, their effect on valid statistical inference has not been analyzed so far. This gap is closed by investigating their validity for inferring mean differences in incompletely observed pairs while opposing them to a recent approach that only works with the given observations at hand. Our findings indicate that machine learning schemes for (multiply) imputing missing values heavily inflate type-I-error in small to moderate matched pairs, even after modifying the test statistics using Rubin’s multiple imputation rule. In addition to an extensive simulation study, an illustrative data example from a breast cancer gene study has been considered.