Ncpred For Accurate Nuclear Protein Prediction Using N Mer Statistics With Various Classification Algorithms

NcPred for accurate nuclear protein prediction using n-mer statistics with various classification algorithms
Md. Saiful Islam, Alaol Kabir, Kazi Sakib, and Md. Alamgir Hossain

Abstract Prediction of nuclear proteins is one of the major challenges in genome annotation. A method, NcPred is described, for predicting nuclear proteins with higher accuracy exploiting n-mer statistics with different classification algorithms namely Alternating Decision (AD) Tree, Best First (BF) Tree, Random Tree and Adaptive (Ada) Boost. On BaCello dataset [1], NcPred improves about 20% accuracy with Random Tree and about 10% sensitivity with Ada Boost for Animal proteins compared to existing techniques.

Md. Saiful Islam, Alaol Kabir, Kazi Sakib, and Md. Alamgir Hossain

A majority of nuclear proteins are synthesized in cytoplasm from where those are transported inside nucleus. But a small number of nucleus-resident proteins are also synthesized inside nucleus. Proteins that are imported to nucleus contain a leader sequence at the N-terminus containing information needed to localize [5]. But this is not true always, as in many cases the leader sequence is altogether absent. In the past, a number of methods were developed to predict proteins, indeed not exclusively for nuclear proteins [18]. The similarity search-based techniques fall under the first category in which the query sequence is searched against experimentally annotated proteins. Although the similarity-based method is very informative and considered to be the best, it becomes severely handicapped when no apparent homology is found [6]. Some of the methods are based on predicting signal sequences where sorting signals, present on the protein, are used. This category includes TargetP [7], SignalP [8]. Although these methods are quite popular, not all proteins have signals; for example, only around 25% of yeast nuclear proteins have matrixtargeting signals particularly at the N-terminus [9]. Methods also attempt to predict subcellular localization on the basis of sequence composition such as ESLpred (Subcellular Localization of Eukaryotic Proteins Prediction) [10], HSLpred [11], NNPSL [6], and LOCSVMPSI [12]. Although their overall performance is good, prediction accuracy of nuclear proteins is much lower than for proteins in other locations. It shows that nuclear protein localization is much more complex and hence warrants special attention. This paper proposes a new technique called NcPred to improve the prediction accuracy of nuclear proteins with four different powerful machine learning algorithms namely AD Tree, BF Tree, Random Tree and Ada Boost. Rather than signals and subcellular localizations, NcPred exploits n-mer statistics presents in the sequence databases. Experimental evaluation shows the suitability of NcPred over the contemporary nuclear protein classification research.

2 Proposed Nuclear Protein Prediction (NcPred) Method 2.1 Modeling the Problem
The classification of nuclear proteins is a binary classification problem and the model developed here is a supervised learner. Formally, a set of protein sequences S = {s1 , s2 , ..., sN } and their labels Y = {y1, y2 , ..., yN } are given (yi ∈ {Nuclear, Non− nuclear}). We wish to determine the label of a newly arrived sequence, snew . Snew → Ynew


Any model M performing this classification should be supervised since the labels of the given sequences are known. That is, each sequence in the database appears as a pair (si , yi ). To learn the model, the study exploits n-mer distribution statis-

