Zuqi Li1, Sonja Katz2,3,4*, Gennady V. Roshchupkin2,5
1 Laboratory for Systems Medicine, Department of Human Genetics, KU Leuven, Leuven, Belgium
2 Department of Radiology and Nuclear Medicine, Erasmus MC, Rotterdam, The Netherlands
3 Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Wageningen, The Netherlands
4 LifeGlimmer GmbH, Berlin, Germany
5 Department of Epidemiology, Erasmus MC, Rotterdam, The Netherlands
sonja.katz [at] wur.nl
Abstract
Unsupervised learning, particularly clustering, is crucial for disease subtyping and patient stratification. With the availability of large-scale multi-omics data, clustering algorithms can be empowered by deep learning models, e.g. variational autoencoder (VAE), to exploit the between-individual heterogeneity. However, the impact of confounders—external factors unrelated to the condition, e.g. batch effect and age—on clustering is often overlooked, introducing bias and spurious biological conclusions.
We proposed four VAE-based deconfounding approaches utilizing multi-omics data: i) removal of latent features correlated with confounders ii) a conditional variational autoencoder (cXVAE), iii) adversarial training, and iv) adding a regularization term to the loss function. Based on real-life multi-omics data from The Cancer Genome Atlas, we simulated various confounding effects (linear, non-linear, categorical, combination) and evaluated each model’s performance across 50 repetitions based on reconstruction error, clustering stability, and deconfounding effectiveness, measured via the adjusted rand index (ARI).
We demonstrated a substantial impact of the artificially introduced confounder effect on patient clustering (ARI: 0.33±0.12), yet various proposed models effectively mitigated this effect, with cXVAE clearly outperforming other frameworks (ARI: 0.66±0.07). cXVAE not only accurately recovers true patient labels but also reveals meaningful pathological associations among cancer types, reinforcing deconfounded representation validity. Conversely, our study proved that some of the proposed strategies, such as adversarial training, are incapable of sufficiently removing confounders.
Our study contributes to delivering accurate patient subgrouping by not only (i) proposing novel frameworks for simultaneous multi-omics data integration, dimensionality reduction, and deconfounding of clustering, but also by (ii) benchmarking respective frameworks on open-access data to aid fellow researchers in selecting an appropriate framework that they can readily apply in a health-related settings.
Keywords: deep learning, autoencoder, multi-omics, confounders, clustering
Acknowledgement: We want to thank the supporters of this study, namely the European Union’s Horizon 2020 Marie Skłodowska-Curie grant agreement (860895). Also, we would like to extend our gratitude to the co-authors making this work possible: Edoardo Saccenti (Wageningen University & Research; The Netherlands), David W. Fardo (University of Kentucky; The United States), Peter Claes (KU Leuven, University Hospitals Leuven; Belgium), Vitor A.P. Martins dos Santos (Wageningen University & Research; The Netherlands, LifeGlimmer GmbH; Germany), and Kristel Van Steen (KU Leuven, University of Liege; Belgium, University of Kentucky; The United States).