Ncpred For Accurate Nuclear Protein Prediction Using N Mer Statistics With Various Classification Algorithms

3249 words - 13 pages

NcPred for accurate nuclear protein prediction using n-mer statistics with various classification algorithms
Md. Saiful Islam, Alaol Kabir, Kazi Sakib, and Md. Alamgir Hossain

Abstract Prediction of nuclear proteins is one of the major challenges in genome annotation. A method, NcPred is described, for predicting nuclear proteins with higher accuracy exploiting n-mer statistics with different classification algorithms namely Alternating Decision (AD) Tree, Best First (BF) Tree, Random Tree and Adaptive (Ada) Boost. On BaCello dataset [1], NcPred improves about 20% accuracy with Random Tree and about 10% sensitivity with Ada Boost for Animal proteins compared to existing techniques. It ...view middle of the document...

e-mail: {k.muheymin − us − sakib, M.A.Hossain1}



Md. Saiful Islam, Alaol Kabir, Kazi Sakib, and Md. Alamgir Hossain

A majority of nuclear proteins are synthesized in cytoplasm from where those are transported inside nucleus. But a small number of nucleus-resident proteins are also synthesized inside nucleus. Proteins that are imported to nucleus contain a leader sequence at the N-terminus containing information needed to localize [5]. But this is not true always, as in many cases the leader sequence is altogether absent. In the past, a number of methods were developed to predict proteins, indeed not exclusively for nuclear proteins [18]. The similarity search-based techniques fall under the first category in which the query sequence is searched against experimentally annotated proteins. Although the similarity-based method is very informative and considered to be the best, it becomes severely handicapped when no apparent homology is found [6]. Some of the methods are based on predicting signal sequences where sorting signals, present on the protein, are used. This category includes TargetP [7], SignalP [8]. Although these methods are quite popular, not all proteins have signals; for example, only around 25% of yeast nuclear proteins have matrixtargeting signals particularly at the N-terminus [9]. Methods also attempt to predict subcellular localization on the basis of sequence composition such as ESLpred (Subcellular Localization of Eukaryotic Proteins Prediction) [10], HSLpred [11], NNPSL [6], and LOCSVMPSI [12]. Although their overall performance is good, prediction accuracy of nuclear proteins is much lower than for proteins in other locations. It shows that nuclear protein localization is much more complex and hence warrants special attention. This paper proposes a new technique called NcPred to improve the prediction accuracy of nuclear proteins with four different powerful machine learning algorithms namely AD Tree, BF Tree, Random Tree and Ada Boost. Rather than signals and subcellular localizations, NcPred exploits n-mer statistics presents in the sequence databases. Experimental evaluation shows the suitability of NcPred over the contemporary nuclear protein classification research.

2 Proposed Nuclear Protein Prediction (NcPred) Method 2.1 Modeling the Problem
The classification of nuclear proteins is a binary classification problem and the model developed here is a supervised learner. Formally, a set of protein sequences S = {s1 , s2 , ..., sN } and their labels Y = {y1, y2 , ..., yN } are given (yi ∈ {Nuclear, Non− nuclear}). We wish to determine the label of a newly arrived sequence, snew . Snew → Ynew


Any model M performing this classification should be supervised since the labels of the given sequences are known. That is, each sequence in the database appears as a pair (si , yi ). To learn the model, the study exploits n-mer distribution statis-

NcPred for accurate nuclear protein...

Other assignments on Ncpred For Accurate Nuclear Protein Prediction Using N-Mer Statistics With Various Classification Algorithms

Week Six Hw Essay

766 words - 4 pages dependent variable utilizing the independent variable. According to Cozby, 2009, an example of the equation that you would use is: Y = a + b 1 X 1 + b 2 X 2 + ... + b n X n In the text, Cozby predicts the admission for college graduate students. However, a researcher may predict something such as stress on a college student, and the independent variables could be amount of sleep, amount of out of class assignments, or having a job (full-time

Nuclear Energy Essay

2856 words - 12 pages turning off and unplugging unused electronics, carpooling, and even planting more plants. The most popular topic or idea for reducing greenhouse gases are the use of alternative energy; most notably, the use of nuclear energy. There are various alternative energy sources but nuclear is the biggest one. Nuclear energy is an alternative solution to solve the global warming problem but may not be the best solution. Nuclear energy is the use of

Ch 16

2519 words - 11 pages classification of tumors, the tumor is identified by the tissue of origin, the anatomic site, and the behavior of the tumor (i.e., benign or malignant). In histologic grading of tumors, the appearance of cells and the degree of differentiation are evaluated pathologically. For many tumor types, four grades are used to evaluate abnormal cells based on the degree to which the cells resemble the tissue of origin. The staging classification system is

Nutrition 21c Btec Helath Science Level 3

1210 words - 5 pages the morning which aids weight loss. There are many benefits that come with a healthy breakfast. As seen above, eggs are included in the meal plan. Including an egg or two for breakfast involves so many health benefits. Eggs are packed with protein and amino acids, they are also one of the few, significant sources of vitamin D, which is important for absorbing calcium for stronger bones. Wholegrains, whether taking them in the form of bread


4702 words - 19 pages constant over time the Markov chain is completely determined by the Markov transition matrix  p11 p 21 Π=  M   p N1 p12 p22 M pN 2 L p1N  L p2 N  , O M   L p NN  pij ≥0, N ∑ pij =1, (2) j =1 which summarizes all N² transition probabilities pij (i, j = 1, …, N), and an initial distribution h0 = (h10 h20 … hN0 ), Σj hj0=1, describing the starting probabilities of the various states. For


2121 words - 9 pages fastest algorithms known for integer factorization. Note, too, that O(log n) is exactly the same as O(log(nc)). The logarithms differ only by a constant factor, and the big O notation ignores that. Similarly, logs with different constant bases are equivalent. The above list is useful because of the following fact: if a function f(n) is a sum of functions, one of which grows faster than the others, then the faster growing one determines

Ch 1 And 2 Notes

1278 words - 6 pages ------------------------------------------------- Ch 1 Introduction 1.1 Why Learn Statistics? * Statistics is the branch of mathematics that transforms numbers into useful information for decision makers. Statistics lets you know about the risks associated with making a business decision and allows you to understand and reduce the variation in the decision-making process. * Statistics provides you with methods for making


2343 words - 10 pages . |1 | |2 |Probability & Statistics | | | |Concept of Variation, Variable and Attribute Data; Frequency Distribution; Measures of Central Tendency & Dispersion | | | |Probability- Definitions, Laws of Probability with problems

Data Security

8305 words - 34 pages I. ------------------------------------------------- Chapter 2: Context and Background I n this chapter, we introduce the main concepts related to the problem we are addressing, in order to provide the casual reader with the necessary background information for this dissertation. As the title of this thesis is “Analysis of Security and QoS in Network with time constraints”, it is clear that our work requires a deep understanding of three main

Study Habits

1132 words - 5 pages the statement is false as it applies to you, circle N for no. Be sure to circle Y or N for each statement. Answer carefully so that you get accurate information. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. I have trouble finishing tests on time. I set aside a regular time for studying every day. Before I read a chapter, I turn headings into questions so that I know what I’m going to learn. I don’t have much

Safety Analysis Of A High Voltage Test Lab

5236 words - 21 pages around the lab by a battery powered walk-behind fork truck. Cleaning of the transformers is accomplished using high pressure heated water mixed with an industrial degreaser and dried by means of compressed air. Testing is carried out by three lab technicians under the supervision of the test lab manager. According the Bureau of labor statistics (BLS) the industry standard for DART is 1.8 (BLS, 2013). Currently at ABB, the DART rate is 0.9 (ABB,2014

Similar Documents

Biometrics Essay

2966 words - 12 pages Optimization of biometric Fingerprint Recognition parameters using Genetic Algorithms Report submitted for CPSC - 6126 Fall 2014 Term Paper By Krishna Sindhuri Nagavolu 13th December, 2014 Optimization  of  biometric  fingerprint  recognition  parameters  using   Genetic  Algorithms   Abstract This research paper discusses about parameter optimization for biometric fingerprint recognition with the use of

Data Mining Term Paper

5913 words - 24 pages • CLOSET+ : An efficient method for closed itemset mining. Using divide and conquer and depth first search strategies and FP-tree as a compression structure. Local frequent itemsets are computed along with a hybrid-tree-projection, item merging, sub-itemset pruning methods and item skipping technique for further prune search space and speed up mining. 3. Classification & Prediction • Classification Partitioning the data into

Sysotolic Array Essay

8075 words - 33 pages , similar to the construction of forward and backward innovations processes, is given in [23], and leads to lattice type algorithms under various assumptions for the x,, i = 1; . . , n. In particular, b , , ( t ) can be interpreted as the "innovation" part of x, relative to Sf-' to time t . We also introduce a up function 6: C X C + C, where for mnemonic reasons we take the liberty to express its dependence upon C X C with the notation 6 { x

Informative Speech

891 words - 4 pages the word of “Nuclear”, what will be in your mine? Nuclear power is being pitched as a suitable replacement for the age-old fossil fuels, and this success of nuclear power can be attributed to various advantages that it has over the other sources of energy. As of today, nuclear power constitutes for about 16 percent of the total power generation in the world. Several countries, such as Japan and France, have acknowledged the prominent