Comparative analysis of machine learning based methods for the prediction of NLR protein


  • Nadia Department of Biotechnology and Bioinformatics, Jaypee University of Information Technology, Waknaghat- 173234, Solan (HP), India
  • Ekta Gandotra Department of Computer Science Engineering &- Information Technology, Jaypee University of Information Technology, Waknaghat- 173234, Solan (HP), India
  • Narendra Kumar Department of Biotechnology and Bioinformatics, Jaypee University of Information Technology, Waknaghat- 173234, Solan (HP), India


NLR, machine learning, SVM, SMO, random forest, cross-validation


In intestinal tissue repair and innate immunity, the nucleotide-binding domain leucine-rich repeat-containing (NLR) proteins play a fundamental role. The NLR protein family is a recent addition to the members of innate immunity effector molecules. It also plays an important role in intestinal microbiota, and recently emerged as a crucial hit for the development of colitis-associated cancer (CAC) and ulcerative colitis (UC). We have developed a Machine Learning based method for the prediction of NLR Proteins. This paper presents a comparative analysis of three supervised machine learning algorithms i.e. Sequential Minimal Optimization (SMO), Library for Support Vector Machine (LIBSVM) and Random Forest (RF) for prediction of NLR proteins. The dataset used for this work is created after extracting the features using ProtR package. The models are trained with the input compositional features generated using dipeptide composition, amino acid composition, etc. The dataset employed for training consists of 390 proteins. It has positive (103 sequences) set consisting of sequences from the NLR family and the remaining dataset (287 sequences) act as a negative training set, which has random protein sequences and several transporter family protein sequences retrieved from the NCBI and Uniprot. 


