Word Embedding Based Feature Extraction for Nepali News Classification
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Department of Computer Science and Information Technology
Abstract
A major challenge in topic classification (TC) is the high dimensionality of the feature
space. Therefore, feature extraction (FE) plays a vital role in topic classification in
particular and text mining in general. FE based on cosine similarity score is commonly used
to reduce the dimensionality of datasets with tens or hundreds of thousands of features,
which can be impossible to process further. In this study, TF-IDF (Term Frequency Inverse
Document Frequency) term weighting is used to extract features. Selecting relevant
features and determining how to encode them for a learning machine method have a vast
impact on the learning machine methods ability to extract a good model.
Count based feature extraction methods is compared with word to vector feature extraction
techniques for Nepali news classification. The results show good classification
performance when using the feature extraction techniques based on word to vector for less
number of classes and drastically decrease the performance for large sample size. On the
other hand result of classification count based technique shows consistent nearly
performance for any number of classes. The overall performance of the TF-IDF (Term
Frequency Inverse Document Frequency) is far better than both word to vector techniques.