Foreign Word Extraction in Nepali Texts

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Department of Computer Science and Information Technology

Abstract

In Nepali text, foreign words, which are mostly transliterations of English words, are frequently used. Foreign words are usually very important index terms in information retrieval since most of them are technical terms or names. So, accurate foreign word extraction is important for high performance of information retrieval. In this study we present a foreign word extraction method for Nepali text document. In order to accurately extract the foreign words, we developed a framework using rule based syllabification. The performance analysis includes different components such as known words, unknown words and size of training data. The present study of supervised rule based syllabification approach is limited due to the existence of same syllable structure for both Nepali and English words and it use a small dictionary which affects its performance. During this study, the efficacy has taken over 12000 syllabified words taken from different daily online news sites. The analysis is done taking into account the various factors like Precision and Recall. In this dissertation, we present a syllabification algorithm for Nepali language. The process of syllabification performs the task of identifying syllables in a word. The correct syllabification rules and algorithms are mainly used in text-to-speech system to improve naturalness of the synthesized speech. We propose an algorithm based on syllable rules matching. The syllable rules matching achieved precision of 83% and recall of 63%.

Description

Citation