Improvement of PBMC Cell Types Classification in Healthy Samples

Minjie Lyu1, Lin Xin1, Lou T. Chitkushev2, Guanglan Zhang2, Derin B. Keskin3 and Vladimir Brusić1*

1 University of Nottingham, Ningbo, China

2 Boston University, Boston, United States

3 Dana-Farber Cancer Institute, Boston, United States

vladimir.brusic [at] nottingham.edu.cn

Abstract

Peripheral blood mononuclear cells (PBMC) are used in the study of the immune system, infectious diseases, and vaccine development. PBMC are composed of six major cell types: B cells, T cells, Natural Killer cells, monocytes, classical dendritic cells, and plasmacytoid dendritic cells. Single cell transcriptomics (SCT) is an emerging technology that concurrently measures gene expression from tens or even hundreds of thousands of individual cells. It provides a higher resolution of gene expression measurement than traditional bulk-sequencing. 10x Genomics is a SCT technology that is able to capture more than 100,000 cells in a single study. Labelling cell types is the first and crucial step in most SCT studies. However, it is impractical to label more than ten thousand cells manually. Supervised machine learning methods such as artificial neural networks (ANN) are suitable for cell type classification. However, the proposed state-of-art accuracy of PBMC classification was less than 80%, which makes it undesirable to label PBMC from new datasets with the latest methods reaching the accuracy of 95% or more. Most classification errors are caused by using the datasets where cell types are incorrectly labelled.

We collected datasets from 10x genomics demonstration databases, including 28 PBMC datasets generated from healthy donors. Datasets were standardized into a common gene list containing 30,698 genes. Quality control (QC) was performed to eliminate dead or low-quality cells, and more than 250,000 cells passed our QC metrics. Four prediction methods (ANN classification, ANN super-class classification, profile-based prediction, and protein-marker-based prediction) were combined to label the cell types. To evaluate the results, we trained ANN models using data from one dataset and tested using the remaining 27 datasets. The average classification accuracy was 98.46%. Datasets with high-accuracy cell type labels can then be used for high-accuracy healthy PBMC cell type classification.

Keywords: cell type prediction, peripheral blood mononuclear cells, scRNA-seq, supervised machine learning, reference datasets