Nepali Document Clustering using K-Means, Mini-Batch K-Means, and DBSCAN
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Department of Computer Science and Information Technology
Abstract
Automated document clustering is the process of grouping documents into a small sets of meaningful
and coherent collections. This research evaluates K-Means, Mini-Batch K-Means and
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithms using four
performance measures: Homogeneity, Completeness, V-Measure and Silhouette Coefficient in
the context of Nepali documents. Features extraction is done using Term Frequency– Inverse
Document Frequency (TFIDF) and TFIDF+ Latent Semantic Indexing (LSI) combination. The
empirical results shows that Mini-Batch K-Means performs better when using TFIDF only and
K-Means performs better when using TFIDF + LSI. Similarly, in time constrained environments,
the clustering time of Mini-Batch K-Means is better than other two algorithms.