Creation of parallel corpus from comparable corpus (for English-Nepali language pair)
Date
2011
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Department of Computer Science and Information Technology
Abstract
Statistical Machine Translation system is a great need of different multilingual
countries like Nepal. But, one of the major bottlenecks in the development of Statistical
Machine Translation systems for different language pairs is the lack of bilingual parallel data
used for training such systems. Such parallel data contains the more or less exact translation
of some source language sentence to the target language sentence. This is what we call
parallel corpus used for training the Statistical Machine Translation System. There are such
parallel corpora available relatively for few language pairs, for few domains and in limited
size. Constructing such useful parallel data manually for different language pairs, different
domains, and of sufficiently large size and good quality is really costly both human and
monetarily.
It is parallel corpora may be the scarce resource, but comparable corpora are the rich,
diverse resource that are readily available in several domains and language pairs. These
corpora consists of a set of documents in two different languages which are not the exact
translations of each other but contain somewhat related and similar information on the same
topic. Such texts in large quantities can be found on the Web, good examples are online news
agencies like CNN, BBC, etc.
In this dissertation, a method is proposed, which lets us to exploit such diverse
resource: comparable corpora in order to extract the parallel data from them in an automated
manner. The proposed method first tries to tokenize the documents at paragraph level and
then candidate target sentences for each source sentence are obtained by using the sentencelength
based
method.
After
that
the
best
match
among
the
candidate
sentences
is
made
based
on
the bilingual dictionary. It has been observed that the quality and the number of words
present in the bilingual dictionary enhance the accuracy of the model for the creation of
parallel corpus from the comparable corpora.
Description
Keywords
Parallel corpus, Translation system, Statistical machine