Data mining for long-non coding RNAs deregulated in colon cancer through analysis of Gene Expression Omnibus database

Iva Pruner1*, and Aleksandra Nikolic1

1Institute of Molecular Genetics and Genetic Engineering,University of Belgrade, Vojvode Stepe 444a, 11042 Belgrade, Serbia

iva [at] imgge.bg.ac.rs

Abstract

Colorectal cancer (CRC) is one of the most commonly diagnosed cancers worldwide. Lack of specific CRC symptoms is a challenge for clinicians, as the symptoms overlap with other non-cancerous diseases, leading to 20-25% of newly diagnosed CRC patients already having liver metastasis. Thus, discovering reliable early-disease biomarkers is of high importance. Non-coding RNAs (ncRNAs) have been demonstrated to be involved in CRC development and progression. Long non-coding RNAs (lncRNAs) can interact with RNA, DNA and proteins, forming complexes that are involved in regulation of gene expression via multiple mechanisms, affecting every stage of colon carcinogenesis and making them top candidates for novel biomarker discovery.

The aim of our study was to conduct data mining of Gene Expression Omnibus (GEO) database by using “colon cancer“ and “ncRNA“ keywords, and identify differentially expressed lnRNAs present in different GEO datasets.

GEO database which collects submitted high-throughput gene expression data was queried for all datasets that studied colon cancer and ncRNA. Over 60 datasets were manually inspected in order to identify those where analysis of colon and normal tissue originating from the same patient was done. Each dataset was analyzed by GEO2R software to discover differentially expressed lncRNAs. LncRNAs were considered significant if they appeared in more than one GEO dataset. Parts of lncRNAs sequences available in GEO2R analysis results were run through BLAST in order to identify full length lncRNAs.

Five GEO datasets matched our criteria. We discovered 12 sequences that appeared in more than one dataset and we identified them through BLAST analysis. Six sequences originated from lncRNAs (RYR3 divergent transcript, long intergenic non-protein coding RNA 595, TOX divergent transcript, FLVCR2 antisense RNA 1, LHRI_LNC744.1 lncRNA gene, and ELFN1 antisense RNA 1), while six sequences represented partial sequences of various mRNAs. Four lncRNAs were down-regulated in colon cancer; one was up-regulated, while one showed different expression patterns in different GEO datasets.

In this study, we have identified six lncRNAs that have potential significance for colorectal cancer etiology and will be a subject of further in silico and in vitro study.

Keywords: long non-coding RNA, colorectal cancer, data mining, GEO database

Comments are closed.