A RULE BASED STEMMER FOR NEPALI
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Pulchowk Campus
Abstract
Stemming is an integral part of Natural Language Processing. It’s a preprocessing
step in almost every NLP application. Arguably, the most important usage of
stemming is in Information Retrieval. While there has been lots of work done on
stemming in languages like English, Nepali stemming has only a few mentionable
works. This study focuses on creating a Rule Based stemmer for Nepali text.
Specifically, it is a affix stripping system that identifies two different types of suffixes
in Nepali grammar and strips them separately. Only a single negativity prefix न is
identified and stripped. This study focuses on a number of techniques like exception
word identification, morphological normalization, word transformation and stemming
limit enforcement to increase stemming performance. The stemmer is also tested
intrinsically using Paice’s method and extrinsically on a basic tf-idf based IR system.
Upon testing, the under-stemming error was found to be 5.27% and the over-stemming error was found to be 0.2% which is a superior performance than existing works. The IR was tested on stemmed vs non-stemmed documents and queries using 14 queries and it was found that the stemming scheme increased the average relevance of retrieved documents by 18.6%.
Description
Stemming is an integral part of Natural Language Processing. It’s a preprocessing
step in almost every NLP application. Arguably, the most important usage of
stemming is in Information Retrieval.