TUCL Repository :: Browsing by Subject "Data Mining"

Browsing by Subject "Data Mining"

Now showing 1 - 4 of 4

Compaative Analysis of Random Forest And Logistic Regression For Diagnosis Of Diabetes Mellitus
(Department of Computer Science and Information Technology, 2019-06) Pandey, Madhu
In our daily life there is lots of data in different field. Whenever there is data we can have lots of information, patterns, meaning etc. and the process of Extracting or “mining” knowledge from large amount of data is called Data mining and is also known as “Knowledge discovery from data (KDD)”. Data mining applications has got rich focus due to its significance of classification algorithms. Diabetes Mellitus (DM) is a result of bad metabolism. DM, if not controlled, causes several complications and even affects other parts of the body. This study aims to survey on the two different classifiers with data set of patients regarding Diabetes Mellitus and to implement as well as assist by comparing Random Forest and Logistic Regression classification techniques to standardize the diagnosis and treatment of Diabetes Mellitus. From the context analysis it was seen that Logistic Regression was able to classify 81.17% of the data correctly which was better than Random Forest in comparison to results of evaluation metrics (Accuracy, Precision, Recall and F-Measure). In a nut shell, the experiment result showed that Logistic Regression had got 2% better accuracy than Random Forest for the diagnosis of diabetes mellitus.
Comparative Analysis of Decision Three Methods For The Prediction Of Paddy Productivity
(Department of Computer Science, 2019-11) Bhatt, Chaturbhuj
Data mining applications has got rich focus due to its significance of classification algorithms. The agricultural data is difficult to study. The challenge from a research perspective is to identify the key attributes that determine paddy performance across different farming situations such as geographic location, soil types, and seasonal conditions. This study aims to survey on the two different decision tree algorithms with primary data set collected in Kanchanpur district and to implement as well as assist by comparing J48 and Simple Cart decision tree methods to predict the production of paddy. From the result analysis it was seen that Simple Cart was able to classify 80.198% of the data correctly which was better than J48 in comparison to results of evaluation metrics (Accuracy, Precision, Recall and F-Measure). In a nut shell, the experiment result showed that J48 has got smaller tree size than Simple Cart but Simple Cart has got 1.9802% better accuracy than J48 for the prediction of paddy productivity.
News clustering system based on text mining
(Department of Computer Science and Information Technology, 2016) Shahi, Deni
Data mining is the process of analyzing data from different perspectives and summarizing it into useful information. This dissertation entitled ―News Clustering System based on Text Mining” is one of the implementation of Data Mining in which the similar type articles of different Newspapers are grouped together which is in English language. In this work, documents from different newspapers’ sites are retrieved i.e. Information Extraction (IE) using crawler then document preprocessing is applied. Parser parses the data into article heading and corresponding links, then the headings are split into individual terms and a list of distinct terms are maintained. Then the porter steaming algorithm is applied over the distinct terms collection. Steaming minimizes the vocabulary size (i.e. no. of terms will be minimized). TF-IDF of individual heading is calculated. This process represents individual content and heading in to n-dimensional vector space (n is the number of distinct terms in the article). Finally, K-means algorithm is implemented to group the news. The Efficiency of K-means Clustering Algorithm has been analyzed for different values of initial number of cluster seeds (K) and different iterations (I). The result analysis is on seven days news data. The result obtained by the experiment shows that the result is efficient with the initial clusters seed 12 (K=12), Iterations to maintain the constant cluster centers in K-means clustering depends upon the number of data sets and running time is also directly proportional to the number of iterations and number of initial clusters seeds. Keywords: Data Mining, Information Extraction, Document Preprocessing, Porter Stemming Algorithm, TF-IDF, K-means Clustering Algorithm
Performance Analysis of Attribute Selection Methods in Decision Tree Induction
(Department of Computer Science & Information Technology, 2018) Yogi, Ganesh
Decision tree learning algorithm has been successfully used in expert systems in capturing knowledge. The main task performed in these systems is using inductive methods to the given values of attributes of an unknown object to determine appropriate classification according to decision tree rules. It is one of the most effective forms to represent and evaluate the performance of algorithms, due to its various eye catching features: simplicity, comprehensibility, no parameters, and being able to handle mixed-type data. There are many decision tree algorithm available named ID3, C4.5, CART, CHAID, QUEST, GUIDE, CRUISE, and CTREE. In this paper, I have used attribute Selection Methods: ID3, C4.5 and CART, and meteorological data collected between 2004 and 2008 from the city of Kathmandu, Nepal, for Decision Tree algorithm. A data model for the meteorological data was developed and this was used to train the Decision Tree with these different attribute selection methods. The performances of these methods were compared using standard performance metrics. Cross fold validation is performed to test the built model i.e. Decision Tree. 10-fold cross validation is performed which partitions the dataset into 10 partitions and uses 90% data as training and 10% as testing. This testing is performed for ten repetitions. Experimentation results show, CART Decision tree has slightly more accuracy with large volume of dataset than that of other algorithms ID3 and C4.5. From the view of speed, C4.5 is better than other two algorithms. CART Decision tree has the average system accuracy rate of 80.9315%, system error rate of 19.0685%, precision rate of 83.1%, and recall rate of 83.1%. Similarly, C4.5 Decision Tree has the average system accuracy rate of 80.6849%, system error rate of 19.3151%, and precision rate of 82% recall rate of 84.4%. And ID3 Decision Tree has the average system accuracy rate of 28.08%, system error rate of 4.08%, and precision rate of 89.4% recall rate of 91.3%. From the time to complete perspective C4.5 completes in 0.05 seconds, ID3 completes in 0.32 seconds where as CART completes in 251.82 seconds. Keywords: Data Mining, Classification, Classifier, ID3, C4.5, CART, Supervised Learning, Unsupervised Learning, Decision Tree, Information Gain, Gain Ratio, Gini Index.

Browsing by Subject "Data Mining"

Results Per Page

Sort Options