Please use this identifier to cite or link to this item: https://elibrary.tucl.edu.np/handle/123456789/7668
Title: Attention And Wave Net Vocoder Based Nepali Text-To-Speech Synthesis
Authors: Basnet, Ashok
Keywords: Text-to-Speech synthesis,;Attention Mechanism,;WaveNet,;Recurrent Neural Network (RNN),;Seq-to-Seq,;Vocoder
Issue Date: Aug-2021
Publisher: Pulchowk Campus
Institute Name: Institute of Engineering
Level: Masters
Citation: MASTER OF SCIENCE IN COMPUTER SYSTEM AND KNOWLEDGE ENGINEERING
Abstract: Since the evolution of Artificial Intelligence, researchers in the field of audio ai are constantly trying to figure out the way for making text-to-speech systems more naturally resonating and are directed towards constructing the human level of voice synthesis network. Synthesis of spoken language from the written text is the major objective of Text-to-Speech synthesis. Such network has a vibrant scope in the field of human-computer inter linkage. Research on deep learning has shown the possibility to infer near human level natural speech from the input text. This work presents the idea for developing end-to-end Nepali speech synthesis network using encoder-decoder architecture conditioned on attention mechanisms followed by WaveNet as Vocoder. The RNN based seq-to-seq feature prediction deep network maps the input character embedding into the latent space representation which is decoded into mel-spectrogram representation. Mel-spectrogram is then converted into the audio waveform by WaveNet vocoder model trained for synthesizing the human speech. The main challenges of the work is the need of high computational power and large data of high quality transcribed audio. Here the network is trained on the Nepali speech dataset from OpenSLR having 157,000 utterances of 165 hours from 527 speakers. The synthesized speech is clear in quality and can be understood by the listener. The quality of synthesized speech was evaluated by listening (i.e. by Mean Opinion Score test). The synthesized sample of speech attained MOS of 3.07, when 40 samples subjected to 10 volunteers. The deep neural network can be trained directly from the data without relying on complex feature engineering, and achieves an acceptable audio quality
Description: Since the evolution of Artificial Intelligence, researchers in the field of audio ai are constantly trying to figure out the way for making text-to-speech systems more naturally resonating and are directed towards constructing the human level of voice synthesis network.
URI: https://elibrary.tucl.edu.np/handle/123456789/7668
Appears in Collections:Electronics and Computer Engineering

Files in This Item:
File Description SizeFormat 
Final Report.pdf2.2 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.