Attention And Wave Net Vocoder Based Nepali Text-To-Speech Synthesis

Basnet, Ashok

Attention And Wave Net Vocoder Based Nepali Text-To-Speech Synthesis

Files

Final Report.pdf (2.15 MB)

Date

2021-08

Authors

Basnet, Ashok

Publisher

Pulchowk Campus

Abstract

Since the evolution of Artificial Intelligence, researchers in the field of audio ai are constantly trying to figure out the way for making text-to-speech systems more naturally resonating and are directed towards constructing the human level of voice synthesis network. Synthesis of spoken language from the written text is the major objective of Text-to-Speech synthesis. Such network has a vibrant scope in the field of human-computer inter linkage. Research on deep learning has shown the possibility to infer near human level natural speech from the input text. This work presents the idea for developing end-to-end Nepali speech synthesis network using encoder-decoder architecture conditioned on attention mechanisms followed by WaveNet as Vocoder. The RNN based seq-to-seq feature prediction deep network maps the input character embedding into the latent space representation which is decoded into mel-spectrogram representation. Mel-spectrogram is then converted into the audio waveform by WaveNet vocoder model trained for synthesizing the human speech. The main challenges of the work is the need of high computational power and large data of high quality transcribed audio. Here the network is trained on the Nepali speech dataset from OpenSLR having 157,000 utterances of 165 hours from 527 speakers. The synthesized speech is clear in quality and can be understood by the listener. The quality of synthesized speech was evaluated by listening (i.e. by Mean Opinion Score test). The synthesized sample of speech attained MOS of 3.07, when 40 samples subjected to 10 volunteers. The deep neural network can be trained directly from the data without relying on complex feature engineering, and achieves an acceptable audio quality

Description

Since the evolution of Artificial Intelligence, researchers in the field of audio ai are constantly trying to figure out the way for making text-to-speech systems more naturally resonating and are directed towards constructing the human level of voice synthesis network.