Performance Analysis of Nepali Text Classiﬁcation using Back Propagation and Naive Bayes Algorithm

Maharjan, Jamuna

Performance Analysis of Nepali Text Classiﬁcation using Back Propagation and Naive Bayes Algorithm

Files

Full Thesis.pdf (1.51 MB)

Date

2014

Authors

Maharjan, Jamuna

Publisher

Department of Computer Science& Information Technology

Abstract

Automated document classiﬁcation is the task of assigning the given document into some class of interest. Text classiﬁcation is the subset of document classiﬁcation as document can be text, image, music, etc. Document classiﬁcation has many applications in library science, information science, computer science and others. It can be used for intellectual categorization of documents, indexing of documents, ﬁltering of spams, routing of emails, identiﬁcation of language, classiﬁcation of genre, etc. The problem of automated document classiﬁcation can be solved in supervised, unsupervised or semi-supervised way. Most of the learning and classiﬁcation algorithms use document attributes and human inference to learn and classify given documents. In this dissertation work, many Natural Language Processing (NLP) techniques are used for document processing and attribute selection. And, two learning based classiﬁcation techniques are used namely, Artiﬁcial Neural Network(ANN) and Naive Bayes Classiﬁer. ANN is a microbiological model of leaning system and Naive Bayes Classiﬁer is a probability based classiﬁcation technique. For the evaluation of the system, we have created Nepali text datasets for ﬁve class of documents: Business, Crime, Education, Health and Sports. There are two separate datasets for training and testing of the system. Training set contains total 1253 documents with 243 for Business, 147 for Crime, 250 for Education, 270 for Health, and 343 for Sports. Similarly, testing dataset contains total 89 documents with 19 for Business, 20 for Crime, 12 for Education, 19 for Health, and 19 for Sports. Training and testing is done by splitting training set into two sets while keeping the testing set unique. Experimentation results show, feed-forward multilayer perceptron based neural network classiﬁer has lower classiﬁcation error rate than Naive Bayes based classiﬁer. MLP classiﬁcation system has the average system accuracy rate of 87:55%, system error rate of 12:44%, precision rate of 80:29% recall rate of 93:41% and f-score rate of 86:55%. Similarly, Naive Bayes classiﬁcation system has the average system accuracy rate of 87:09%, system error rate of 12:90%, precision rate of 79:37% recall rate of 93:87% and f-score rate of 86:05%. Keywords: Automated Document Categorization, Text Classiﬁcation, Natural language processing, Nepali language, Preprocessing, Feature extraction, Artiﬁcial Neural Networks, Multilayer Perceptron, Naive Bayes Classiﬁer