Machine learning methods for metabolite biomarkers detection

Miličić Lucija1*, Kovačević Jovana1,2 and Kovačević Vladimir2

1 Faculty of Mathematics, University of Belgrade, Belgrade, Serbia

2 Institute for Artificial Intelligence Research and Development of Serbia, Novi Sad, Serbia

lucija.milicic [at] matf.bg.ac.rs

Abstract

Metabolites provide a unique view of the state of the entire organism. These small molecules produced by cellular processes can serve as indicators of a significant change in the body. The latest technological advances enabled measurement of up to a thousand metabolites from the blood, which paved the way for their usage as biomarkers or therapeutic activity indicators. The obtained metabolomic data requires special statistical and machine learning techniques for analyzing datasets with large number of features.

In our study we propose the methodology for processing metabolomics datasets with samples originating from groups with different phenotypes (e.g. disease and control group) and detecting metabolite candidates for potential biomarkers. We used preeclampsia datasets as a case study to test our methodology. The research focus was to determine whether any of the measured metabolites could indicate the onset of preeclampsia, and if so, to identify the most significant ones.

The approach to this problem involved developing a XGBoost classifier that would predict whether a patient has preeclampsia based on measured concentration of metabolites. For addressing the high dimensionality in the dataset, feature selector mRMR (minimum Redundancy – Maximum Relevance) was applied. The resulting model with an accuracy of 0.74 and ROC-AUC score of 0.8 on the test data, was used to identify the most important features that represent potential biomarker candidates. Statistical tests such as the T-test and Mann-Whitney test additionally confirmed a significant difference in the distribution of concentration of these metabolites between patients with and without preeclampsia. We detected increased concentration of specific fatty acids, along with cortisol, the stress hormone, in patients with preeclampsia. Further research will focus on understanding the mechanisms underlying these changes and their clinical relevance.

Keywords: bioinformatics, data mining, machine learning, metabolomics, preeclampsia

Acknowledgement: The authors want to thank The Chinese University of Hong Kong for collecting and publishing metabolomics data used in this research.