Attention And Wave Net Vocoder Based Nepali Text-To-Speech Synthesis

dc.contributor.authorBasnet, Ashok
dc.date.accessioned2022-01-25T09:38:23Z
dc.date.available2022-01-25T09:38:23Z
dc.date.issued2021-08
dc.descriptionSince the evolution of Artificial Intelligence, researchers in the field of audio ai are constantly trying to figure out the way for making text-to-speech systems more naturally resonating and are directed towards constructing the human level of voice synthesis network.en_US
dc.description.abstractSince the evolution of Artificial Intelligence, researchers in the field of audio ai are constantly trying to figure out the way for making text-to-speech systems more naturally resonating and are directed towards constructing the human level of voice synthesis network. Synthesis of spoken language from the written text is the major objective of Text-to-Speech synthesis. Such network has a vibrant scope in the field of human-computer inter linkage. Research on deep learning has shown the possibility to infer near human level natural speech from the input text. This work presents the idea for developing end-to-end Nepali speech synthesis network using encoder-decoder architecture conditioned on attention mechanisms followed by WaveNet as Vocoder. The RNN based seq-to-seq feature prediction deep network maps the input character embedding into the latent space representation which is decoded into mel-spectrogram representation. Mel-spectrogram is then converted into the audio waveform by WaveNet vocoder model trained for synthesizing the human speech. The main challenges of the work is the need of high computational power and large data of high quality transcribed audio. Here the network is trained on the Nepali speech dataset from OpenSLR having 157,000 utterances of 165 hours from 527 speakers. The synthesized speech is clear in quality and can be understood by the listener. The quality of synthesized speech was evaluated by listening (i.e. by Mean Opinion Score test). The synthesized sample of speech attained MOS of 3.07, when 40 samples subjected to 10 volunteers. The deep neural network can be trained directly from the data without relying on complex feature engineering, and achieves an acceptable audio qualityen_US
dc.identifier.citationMASTER OF SCIENCE IN COMPUTER SYSTEM AND KNOWLEDGE ENGINEERINGen_US
dc.identifier.urihttps://hdl.handle.net/20.500.14540/7668
dc.language.isoenen_US
dc.publisherPulchowk Campusen_US
dc.subjectText-to-Speech synthesis,en_US
dc.subjectAttention Mechanism,en_US
dc.subjectWaveNet,en_US
dc.subjectRecurrent Neural Network (RNN),en_US
dc.subjectSeq-to-Seq,en_US
dc.subjectVocoderen_US
dc.titleAttention And Wave Net Vocoder Based Nepali Text-To-Speech Synthesisen_US
dc.typeThesisen_US
local.academic.levelMastersen_US
local.affiliatedinstitute.titlePulchowk Campusen_US
local.institute.titleInstitute of Engineeringen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Final Report.pdf
Size:
2.15 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: