Nepali Document Clustering using K-Means, Mini-Batch K-Means, and DBSCAN

Maharjan, Aman2022-04-282022-04-282018https://hdl.handle.net/20.500.14540/10020Automated document clustering is the process of grouping documents into a small sets of meaningful and coherent collections. This research evaluates K-Means, Mini-Batch K-Means and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithms using four performance measures: Homogeneity, Completeness, V-Measure and Silhouette Coefficient in the context of Nepali documents. Features extraction is done using Term Frequency– Inverse Document Frequency (TFIDF) and TFIDF+ Latent Semantic Indexing (LSI) combination. The empirical results shows that Mini-Batch K-Means performs better when using TFIDF only and K-Means performs better when using TFIDF + LSI. Similarly, in time constrained environments, the clustering time of Mini-Batch K-Means is better than other two algorithms.en-USNepali document clusteringMini-Batch K-MeansDBSCANMachine learningNepali Document Clustering using K-Means, Mini-Batch K-Means, and DBSCANThesis