News clustering system based on text mining
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Department of Computer Science and Information Technology
Abstract
Data mining is the process of analyzing data from different perspectives and summarizing it into
useful information. This dissertation entitled ―News Clustering System based on Text Mining”
is one of the implementation of Data Mining in which the similar type articles of different
Newspapers are grouped together which is in English language.
In this work, documents from different newspapers’ sites are retrieved i.e. Information
Extraction (IE) using crawler then document preprocessing is applied. Parser parses the data into
article heading and corresponding links, then the headings are split into individual terms and a
list of distinct terms are maintained. Then the porter steaming algorithm is applied over the
distinct terms collection. Steaming minimizes the vocabulary size (i.e. no. of terms will be
minimized). TF-IDF of individual heading is calculated. This process represents individual
content and heading in to n-dimensional vector space (n is the number of distinct terms in the
article). Finally, K-means algorithm is implemented to group the news.
The Efficiency of K-means Clustering Algorithm has been analyzed for different values of initial
number of cluster seeds (K) and different iterations (I). The result analysis is on seven days news
data. The result obtained by the experiment shows that the result is efficient with the initial
clusters seed 12 (K=12), Iterations to maintain the constant cluster centers in K-means clustering
depends upon the number of data sets and running time is also directly proportional to the
number of iterations and number of initial clusters seeds.
Keywords: Data Mining, Information Extraction, Document Preprocessing, Porter Stemming
Algorithm, TF-IDF, K-means Clustering Algorithm