News clustering system based on text mining

Shahi, Deni

Please use this identifier to cite or link to this item: https://elibrary.tucl.edu.np/handle/123456789/17274

Full metadata record

DC Field	Value	Language
dc.contributor.author	Shahi, Deni	-
dc.date.accessioned	2023-05-23T07:03:24Z	-
dc.date.available	2023-05-23T07:03:24Z	-
dc.date.issued	2016	-
dc.identifier.uri	https://elibrary.tucl.edu.np/handle/123456789/17274	-
dc.description.abstract	Data mining is the process of analyzing data from different perspectives and summarizing it into useful information. This dissertation entitled ―News Clustering System based on Text Mining” is one of the implementation of Data Mining in which the similar type articles of different Newspapers are grouped together which is in English language. In this work, documents from different newspapers’ sites are retrieved i.e. Information Extraction (IE) using crawler then document preprocessing is applied. Parser parses the data into article heading and corresponding links, then the headings are split into individual terms and a list of distinct terms are maintained. Then the porter steaming algorithm is applied over the distinct terms collection. Steaming minimizes the vocabulary size (i.e. no. of terms will be minimized). TF-IDF of individual heading is calculated. This process represents individual content and heading in to n-dimensional vector space (n is the number of distinct terms in the article). Finally, K-means algorithm is implemented to group the news. The Efficiency of K-means Clustering Algorithm has been analyzed for different values of initial number of cluster seeds (K) and different iterations (I). The result analysis is on seven days news data. The result obtained by the experiment shows that the result is efficient with the initial clusters seed 12 (K=12), Iterations to maintain the constant cluster centers in K-means clustering depends upon the number of data sets and running time is also directly proportional to the number of iterations and number of initial clusters seeds. Keywords: Data Mining, Information Extraction, Document Preprocessing, Porter Stemming Algorithm, TF-IDF, K-means Clustering Algorithm	en_US
dc.language.iso	en_US	en_US
dc.publisher	Department of Computer Science and Information Technology	en_US
dc.subject	Data Mining	en_US
dc.subject	Information extraction	en_US
dc.subject	Information extraction	en_US
dc.subject	Porter stemming algorithm	en_US
dc.subject	TF-IDF, K-means	en_US
dc.subject	Clustering algorithm	en_US
dc.title	News clustering system based on text mining	en_US
dc.type	Thesis	en_US
local.institute.title	Central Department of Computer Science and Information Technology	en_US
local.academic.level	Masters	en_US
Appears in Collections:	Computer Science & Information Technology

Files in This Item:

File	Description	Size	Format
Full Thesis.pdf		1.32 MB	Adobe PDF	View/Open

Show simple item record

TUCL eLibrary

Easy and open access to all types of digital resources of TUCL