Comparative Study of Clustering Algorithms for Nepali News
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Department of Computer Science and Information Technology
Abstract
Clustering is an important technique to separate data categories based on their feature
similarity. Clustering belong to unsupervised type of machine learning algorithms. Among
many clustering algorithms, three representative algorithms namely K-means, X-means and
Expectation Maximization are experimented for the Nepali news clustering problem in this
research work. News clustering is the task of categorizing news into groups that share similar
interests. Clustering algorithms are evaluated for optimal performances based on cluster
evaluation metrics and execution time. Evaluation metrics used are Dunn index, DB index and
CH index. Execution time includes clustering time and training time. TF-IDF is used as a news
embedding representation. Algorithms are also evaluated with reduced feature dimensions by
applying PCA.
To select the winner algorithm and setting the values of DB index, training time and clustering
time must be lower and value of CH index and Dunn index must be higher. So, based upon
the evaluation results, we conclude the winning algorithm and strategies in some states as
follows. When feature dimension is high (>= 10000) K-Means perform better then others.
When applied PCA to reduce feature space, EM algorithm better performs than others. With
reduced feature space, K-Means still performs better then X-Means clustering algorithm.