GESTURE SYNTHESIS USING MULTIMODAL SUPERVISED LEARNING
Date
2023-05
Journal Title
Journal ISSN
Volume Title
Publisher
I.O.E. Pulchowk Campus
Abstract
One of the long-standing ambitions of the modern science and engineering has been to create
a non-human entity that manifests human-like intelligence and behavior. One step to
achieving the goal is executing a communication just like the humans do. Human speech
is often accompanied by a variety of gestures which add rich non-verbal information to the
message the speaker is trying to convey. Gestures add clarity to the intention and emotions
of the speaker and enhance the speech by adding visual cues alongside audio signal. Our
project aims to synthesize co-speech gestures by learning from individual speaker’s style. We
follow a data-driven approach instead of rule-based approach as the audio-gesture relation
is poorly captured by a rule-based system due to issues like asynchrony and multi-modality.
As is the current trend, we train the modal from in-the-wild videos embedded with audio
instead of relying on the motion capture of subjects in lab for video annotation. For establishing
the ground truth for the data set of video frames, we rely on an automatic pose
detection system. Although the ground truth signal tends to be not as accurate as manually
annotated frames, the approach relieves us of time and labor expense. We perform the crossmodal
translation from monologue speech of a single speaker to their hand and arm motion
based on the learning of temporal correlation between the sequence of pose and audio sample.
Description
One of the long-standing ambitions of the modern science and engineering has been to create
a non-human entity that manifests human-like intelligence and behavior. One step to
achieving the goal is executing a communication just like the humans do.
Keywords
Gesture synthesis,, Supervised learning,, Human Computer Interaction,