Performance Analysis of Nepali Text Classification using Back Propagation and Naive Bayes Algorithm
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Department of Computer Science& Information Technology
Abstract
Automated document classification is the task of assigning the given document into some class
of interest. Text classification is the subset of document classification as document can be
text, image, music, etc. Document classification has many applications in library science,
information science, computer science and others. It can be used for intellectual categorization
of documents, indexing of documents, filtering of spams, routing of emails, identification of
language, classification of genre, etc.
The problem of automated document classification can be solved in supervised, unsupervised
or semi-supervised way. Most of the learning and classification algorithms use document attributes
and human inference to learn and classify given documents. In this dissertation work,
many Natural Language Processing (NLP) techniques are used for document processing and
attribute selection. And, two learning based classification techniques are used namely, Artificial
Neural Network(ANN) and Naive Bayes Classifier. ANN is a microbiological model of
leaning system and Naive Bayes Classifier is a probability based classification technique.
For the evaluation of the system, we have created Nepali text datasets for five class of documents:
Business, Crime, Education, Health and Sports. There are two separate datasets for
training and testing of the system. Training set contains total 1253 documents with 243 for
Business, 147 for Crime, 250 for Education, 270 for Health, and 343 for Sports. Similarly,
testing dataset contains total 89 documents with 19 for Business, 20 for Crime, 12 for Education,
19 for Health, and 19 for Sports. Training and testing is done by splitting training set
into two sets while keeping the testing set unique. Experimentation results show, feed-forward
multilayer perceptron based neural network classifier has lower classification error rate than
Naive Bayes based classifier. MLP classification system has the average system accuracy rate
of 87:55%, system error rate of 12:44%, precision rate of 80:29% recall rate of 93:41% and
f-score rate of 86:55%. Similarly, Naive Bayes classification system has the average system
accuracy rate of 87:09%, system error rate of 12:90%, precision rate of 79:37% recall rate of
93:87% and f-score rate of 86:05%.
Keywords:
Automated Document Categorization, Text Classification, Natural language processing, Nepali
language, Preprocessing, Feature extraction, Artificial Neural Networks, Multilayer Perceptron,
Naive Bayes Classifier