GESTURE SYNTHESIS USING MULTIMODAL SUPERVISED LEARNING

BHANDARI, PRASUN; ROUNIYAR, RAHUL; SHRESTHA, RONAB; GURAU, SUNIL

GESTURE SYNTHESIS USING MULTIMODAL SUPERVISED LEARNING

Files

Prasun Bhandari et al. be report electronics may 2023.pdf (1.77 MB)

Date

2023-05

Authors

Publisher

I.O.E. Pulchowk Campus

Abstract

One of the long-standing ambitions of the modern science and engineering has been to create a non-human entity that manifests human-like intelligence and behavior. One step to achieving the goal is executing a communication just like the humans do. Human speech is often accompanied by a variety of gestures which add rich non-verbal information to the message the speaker is trying to convey. Gestures add clarity to the intention and emotions of the speaker and enhance the speech by adding visual cues alongside audio signal. Our project aims to synthesize co-speech gestures by learning from individual speaker’s style. We follow a data-driven approach instead of rule-based approach as the audio-gesture relation is poorly captured by a rule-based system due to issues like asynchrony and multi-modality. As is the current trend, we train the modal from in-the-wild videos embedded with audio instead of relying on the motion capture of subjects in lab for video annotation. For establishing the ground truth for the data set of video frames, we rely on an automatic pose detection system. Although the ground truth signal tends to be not as accurate as manually annotated frames, the approach relieves us of time and labor expense. We perform the crossmodal translation from monologue speech of a single speaker to their hand and arm motion based on the learning of temporal correlation between the sequence of pose and audio sample.

Description

One of the long-standing ambitions of the modern science and engineering has been to create a non-human entity that manifests human-like intelligence and behavior. One step to achieving the goal is executing a communication just like the humans do.