Mouse Tissue of Origin Single Cell Classification System

Sen Lin1, Vladimir Brusić2 and Tianyi Qiu3

1 School of Computer Science, University of Nottingham Ningbo, Ningbo, China

2 School of Economics, University of Nottingham Ningbo, Ningbo, China

3 Institute of Clinical Science, Zhongshan Hospital, Fudan University, Shanghai, China

SEN.LIN2 [at] nottingham.edu.cn and vladimir.brusic [at] nottingham.edu.cn

Abstract

Single cell transcriptomics (scRNA-seq) technology can concurrently measure gene expression from hundreds of thousands of individual cells. The aim of this project is to build a system for classification of tissue and organ of origin of single cell types and subtypes for the mouse samples. The classification system maps the cells using scRNA-seq data and supervised machine learning methods.

The first step was to develop a hierarchical mouse cell type map to comprehend the biology and heterogeneity of different cell types. The second step was to build mouse scRNA-seq reference datasets containing high-quality and well-annotated scRNA-seq datasets representing different mouse strains, animal ages, sexes, and biological conditions. Third, his work established computational workflows that integrated standardization, quality control, clustering, annotation, and classification model building of scRNA-seq data. The standardization work involved a protocol for data processing that mapped different gene versions, names, quantities, and data formats to standardized formats. Quality control involved data visualization, filtering errors and uninformative measurements based on cell distribution, detection of outliers, and using standard gene markers to filter unrelated cells. The final step was to build a classification system based on supervised machine learning. The step deployed the feed-forward multilayer artificial neural network with logistic regression learning algorithms. The resulting model built using approximately 117 thousand cells from 15 different tissues achieved classification accuracy of 93.6%. Most misclassified cells were immune cells that are known to migrate across tissues and organs. The classification accuracy of tissue of origin of immune cells differs according to the tissue with high classification accuracy (>95%) for aorta, heart, islets, liver, pancreas, and thymus. Classification accuracy of immune cells from blood, colon, lung, pancreatic lymph nodes, small intestine, and spleen as tissue of origin was lower.

This research demonstrated that supervised machine learning methods could achieve high accuracy in classifying mouse single cell types from most tissues and organs.

Keywords: bioinformatics, data mining, computer science, single cell transcriptome, machine learning

Acknowledgement: This work was supported by the University of Nottingham Ningbo High-Flyer PhD Scholarship.