Feature Selection For Multi-Source SCT Data

Saša Malkov1*, Nenad Mitić1

1 Faculty of Mathematics, University of Belgrade, Belgrade, Serbia

sasa.malkov [at] matf.bg.ac.rs

Abstract

Bioinformatics experiments often produce large data sets, with lots of samples and many different attributes, which provide high dimensionality of the data. Even if some of the dimensions have little significance for specific data analysis, they can prove useful in complex data processing. Advanced data mining techniques and AI algorithms typically welcome high dimensionality of data. However, if there are too many dimensions, we can run into the curse of dimensionality, because an abundance of dimensions can introduce additional complexity and cost to data handling and processing, as well as overfitting the model to less important dimensions. In order to make data processing more efficient and improve the quality of created models, we often need dimensionality reduction.

Single cell transcriptomics (SCT) is one of the most important sequencing technologies today. It enables simultaneous measurement of the activity of thousand of genes in individual cells, resulting in RNA profiles of the cells. Such profiles allow researchers to analyze the physiological activity of cells in different circumstances, including different biochemical conditions but also different health conditions of the subjects. By processing large numbers of cells, SCT provides a lot of sample data. Each detectable RNA represents one dimension of data, which ultimately gives us thousands of dimensions. Moreover, initial cells conditions, complex cell preparation techniques, and RNA measurement methods can vary significantly, resulting in significant differences in data coming from different sources.

Here we discuss the feature selection problem on the example of more than 120,000 instances of peripheral blood mononuclear cell (PBMC) SCT data from four different sources, with the detection of 30,698 genes (dimensions). We pay special attention to the imbalanced nature of the data and consider feature selection methods that allow for an unbiased set of significant features to be obtained as a result. We show that statistical correlation-based feature selection, with some support from mutual information-based techniques, can result in a reasonably complex method for high-quality feature set selection.

Keywords: bioinformatics, feature selection, statistical correlation, mutual information, transcriptomics data.