-
Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm
- Back
Metadata
Document Title
Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm
Author
Lertampaiporn S, Thammarongtham C, Nukoolkit C, Kaewkamnerdpong B, Ruengjitchatchawalya M
Name from Authors Collection
Affiliations
King Mongkuts University of Technology Thonburi; King Mongkuts University of Technology Thonburi; National Science & Technology Development Agency - Thailand; National Center Genetic Engineering & Biotechnology (BIOTEC); King Mongkuts University of Technology Thonburi; King Mongkuts University of Technology Thonburi; King Mongkuts University of Technology Thonburi
Type
Article
Source Title
NUCLEIC ACIDS RESEARCH
ISSN
0305-1048
Year
2014
Volume
42
Issue
2
Open Access
Green Published, Green Submitted, gold
Publisher
OXFORD UNIV PRESS
DOI
10.1093/nar/gku325
Format
Abstract
To identify non-coding RNA (ncRNA) signals within genomic regions, a classification tool was developed based on a hybrid random forest (RF) with a logistic regression model to efficiently discriminate short ncRNA sequences as well as long complex ncRNA sequences. This RF-based classifier was trained on a well-balanced dataset with a discriminative set of features and achieved an accuracy, sensitivity and specificity of 92.11%, 90.7% and 93.5%, respectively. The selected feature set includes a new proposed feature, SCORE. This feature is generated based on a logistic regression function that combines five significant features-structure, sequence, modularity, structural robustness and coding potential-to enable improved characterization of long ncRNA (lncRNA) elements. The use of SCORE improved the performance of the RF-based classifier in the identification of Rfam lncRNA families. A genome-wide ncRNA classification framework was applied to a wide variety of organisms, with an emphasis on those of economic, social, public health, environmental and agricultural significance, such as various bacteria genomes, the Arthrospira (Spirulina) genome, and rice and human genomic regions. Our framework was able to identify known ncRNAs with sensitivities of greater than 90% and 77.7% for prokaryotic and eukaryotic sequences, respectively. Our classifier is available at http://ncrna-pred.com/HLRF.htm.
Industrial Classification
Knowledge Taxonomy Level 1
Knowledge Taxonomy Level 2
Knowledge Taxonomy Level 3
Funding Sponsor
National Research University Project of Thailand's Office of the Higher Education Commission [54000318]; King Mongkut's University of Technology Thonburi
License
CC BY-NC
Rights
Authors
Publication Source
WOS