Please use this identifier to cite or link to this item: https://elibrary.tucl.edu.np/handle/123456789/20597
Title: Creation of parallel corpus from comparable corpus (for English-Nepali language pair)
Authors: Pant, Hari Prashad
Keywords: Parallel corpus;Translation system;Statistical machine
Issue Date: 2011
Publisher: Department of Computer Science and Information Technology
Institute Name: Central Department of Computer Science and Information Technology
Level: Masters
Abstract: Statistical Machine Translation system is a great need of different multilingual countries like Nepal. But, one of the major bottlenecks in the development of Statistical Machine Translation systems for different language pairs is the lack of bilingual parallel data used for training such systems. Such parallel data contains the more or less exact translation of some source language sentence to the target language sentence. This is what we call parallel corpus used for training the Statistical Machine Translation System. There are such parallel corpora available relatively for few language pairs, for few domains and in limited size. Constructing such useful parallel data manually for different language pairs, different domains, and of sufficiently large size and good quality is really costly both human and monetarily. It is parallel corpora may be the scarce resource, but comparable corpora are the rich, diverse resource that are readily available in several domains and language pairs. These corpora consists of a set of documents in two different languages which are not the exact translations of each other but contain somewhat related and similar information on the same topic. Such texts in large quantities can be found on the Web, good examples are online news agencies like CNN, BBC, etc. In this dissertation, a method is proposed, which lets us to exploit such diverse resource: comparable corpora in order to extract the parallel data from them in an automated manner. The proposed method first tries to tokenize the documents at paragraph level and then candidate target sentences for each source sentence are obtained by using the sentencelength based method. After that the best match among the candidate sentences is made based on the bilingual dictionary. It has been observed that the quality and the number of words present in the bilingual dictionary enhance the accuracy of the model for the creation of parallel corpus from the comparable corpora.
URI: https://elibrary.tucl.edu.np/handle/123456789/20597
Appears in Collections:Computer Science & Information Technology

Files in This Item:
File Description SizeFormat 
Full thesis.pdf1.52 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.