Creation of parallel corpus from comparable corpus (for English-Nepali language pair)

Pant, Hari Prashad

Creation of parallel corpus from comparable corpus (for English-Nepali language pair)

Files

Full thesis.pdf (1.48 MB)

Date

2011

Authors

Pant, Hari Prashad

Publisher

Department of Computer Science and Information Technology

Abstract

Statistical Machine Translation system is a great need of different multilingual countries like Nepal. But, one of the major bottlenecks in the development of Statistical Machine Translation systems for different language pairs is the lack of bilingual parallel data used for training such systems. Such parallel data contains the more or less exact translation of some source language sentence to the target language sentence. This is what we call parallel corpus used for training the Statistical Machine Translation System. There are such parallel corpora available relatively for few language pairs, for few domains and in limited size. Constructing such useful parallel data manually for different language pairs, different domains, and of sufficiently large size and good quality is really costly both human and monetarily. It is parallel corpora may be the scarce resource, but comparable corpora are the rich, diverse resource that are readily available in several domains and language pairs. These corpora consists of a set of documents in two different languages which are not the exact translations of each other but contain somewhat related and similar information on the same topic. Such texts in large quantities can be found on the Web, good examples are online news agencies like CNN, BBC, etc. In this dissertation, a method is proposed, which lets us to exploit such diverse resource: comparable corpora in order to extract the parallel data from them in an automated manner. The proposed method first tries to tokenize the documents at paragraph level and then candidate target sentences for each source sentence are obtained by using the sentencelength based method. After that the best match among the candidate sentences is made based on the bilingual dictionary. It has been observed that the quality and the number of words present in the bilingual dictionary enhance the accuracy of the model for the creation of parallel corpus from the comparable corpora.