Nepali Document Clustering using K-Means, Mini-Batch K-Means, and DBSCAN

Maharjan, Aman

Nepali Document Clustering using K-Means, Mini-Batch K-Means, and DBSCAN

Files

thesis.pdf (608.61 KB)

Date

2018

Authors

Maharjan, Aman

Publisher

Department of Computer Science and Information Technology

Abstract

Automated document clustering is the process of grouping documents into a small sets of meaningful and coherent collections. This research evaluates K-Means, Mini-Batch K-Means and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithms using four performance measures: Homogeneity, Completeness, V-Measure and Silhouette Coefficient in the context of Nepali documents. Features extraction is done using Term Frequency– Inverse Document Frequency (TFIDF) and TFIDF+ Latent Semantic Indexing (LSI) combination. The empirical results shows that Mini-Batch K-Means performs better when using TFIDF only and K-Means performs better when using TFIDF + LSI. Similarly, in time constrained environments, the clustering time of Mini-Batch K-Means is better than other two algorithms.