Text similarity using corpus based semantic word similarity and string similarity for short Nepali texts
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Department of Computer Science and Information Technology
Abstract
Similarity measure for long text, documents have been in research from long time but
similarity measure for short text were not been given much emphasis. Short Texts and
sentences similarity measures are now considered to be important research topic due to its
many applications in the field of Natural language processing and information retrieval. The
need to determine semantic similarity, semantic distance between two lexically expressed
concepts is a problem that pervades much of natural language processing. This thesis deals
with one of Information Retrieval’s big interest: Textual Similarity. This thesis includes the
study and implementation of short text similarity measure for Nepali language. The semantic
text similarity has not been yet studied for Nepali language text. This thesis deals with two
main challenges .The first is to determine the similarity of the two short texts having different
lexical terms and the second is determining the semantic similarity based on string similarity
for considering the minor spelling mistakes of the words in the sentence. Such measures
should mostly be considered during web retrieval as users may not always give the right
spelling for the words. Nepali language is based on devanagari script and has different
literature. This thesis includes the implementation and analysis of the String similarity
measures (Modified version of Longest Common Subsequences and String edit distance) and
corpus based word similarity measure (Second Order Co-Occurrence Point Wise Mutual
Information) for overall semantic Text similarity. Improvement has been done for the
integration of word similarity measure and string similarity measure.