Attention And Wave Net Vocoder Based Nepali Text-To-Speech Synthesis
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Pulchowk Campus
Abstract
Since the evolution of Artificial Intelligence, researchers in the field of audio ai are
constantly trying to figure out the way for making text-to-speech systems more
naturally resonating and are directed towards constructing the human level of voice
synthesis network. Synthesis of spoken language from the written text is the major
objective of Text-to-Speech synthesis. Such network has a vibrant scope in the
field of human-computer inter linkage. Research on deep learning has shown the
possibility to infer near human level natural speech from the input text. This work
presents the idea for developing end-to-end Nepali speech synthesis network using
encoder-decoder architecture conditioned on attention mechanisms followed by
WaveNet as Vocoder. The RNN based seq-to-seq feature prediction deep network
maps the input character embedding into the latent space representation which is
decoded into mel-spectrogram representation. Mel-spectrogram is then converted
into the audio waveform by WaveNet vocoder model trained for synthesizing the
human speech. The main challenges of the work is the need of high computational
power and large data of high quality transcribed audio. Here the network is
trained on the Nepali speech dataset from OpenSLR having 157,000 utterances of
165 hours from 527 speakers. The synthesized speech is clear in quality and can
be understood by the listener. The quality of synthesized speech was evaluated
by listening (i.e. by Mean Opinion Score test). The synthesized sample of speech
attained MOS of 3.07, when 40 samples subjected to 10 volunteers. The deep
neural network can be trained directly from the data without relying on complex
feature engineering, and achieves an acceptable audio quality
Description
Since the evolution of Artificial Intelligence, researchers in the field of audio ai are
constantly trying to figure out the way for making text-to-speech systems more
naturally resonating and are directed towards constructing the human level of voice
synthesis network.
Citation
MASTER OF SCIENCE IN COMPUTER SYSTEM AND KNOWLEDGE ENGINEERING