September 15, 2023

17 Min Read

ZeroEGGS : Zero-Shot Example-Based Gesture Generation From Speech

Link to the paper : https://onlinelibrary.wiley.com/doi/10.1111/cgf.14734

Open world game maps are constantly growing and with it the number of digital beings living in these worlds. Keeping the general populations of these games realistic and lifelike is part of keeping the player engaged. However, with more characters, animals, and non-playable characters (NPCs), animators must get creative on keeping each character lifelike and unique while still meet that scale.

To help with this problem Ubisoft La Forge, in collaboration with Ubisoft in-house capture studio and animators, are currently researching different animation systems and prototyping based on recent academic findings.

Our mission is to provide animators with alternative ways of animating that can increase production speed, while fostering uniqueness of each digital being in our games.

In this work we present ZeroEGGS, a machine learning speech-driven gesture generation with zero-shot style control by example.

Short examples of motion style clips selected by the artist are used and even for styles never seen during the training, gestures will be synthesized respecting artist’s input.

ZeroEGGS will also generate a variety of outputs for the same input: narrative designers and animators can further control gestures characteristics by blending or scaling depending on the required output needed.

Because ZeroEGGS is also very light weight in memory compared to current tools, this will offer scalability for data and potential real-time usage!

Current Tools

A common method for the automation of gesture animation is to trigger pre-recorded animations from a database arranged by pre-set tags defining emotion through manual programming and labeling. This is usually done by first collecting a mocap database of gesture animation along with the corresponding audio and transcripts. Then during the runtime, the main keywords are extracted from the transcripts along with their timings. Then a list of animations that match the keyword (as well as its similar words) and emotion defined by the user is pooled from the database. Each candidate is scored based on a couple of factors such as length of the animation or energy of the gesture. Finally, the candidate with the highest score is selected. These high scoring candidates are then put together in a sequence of segments which are blended to remove the discontinuities between them.

This explains the core of such approaches, however usually some manual pre- and post-processing stages are needed as well. For a detailed explanation of one way of doing this, you can watch the talk by Francois Paradis on Procedural Generation of Cinematic Dialogues in Assassin's Creed Odyssey

While effective, this method requires significant amounts of time and manpower in order to produce a variety of engaging and realistic animations. Another downside is that such approaches are heavily data-dependent, where the amount of required data depends on the design, and the needed quality and flexibility. For example, what types of emotion should be supported or how much should they match the given keywords. Such data-dependency will result in a high amount of data for a AAA game production. On the other hand, MoCap sessions are very resource intensive and keeping all this animation data in memory and searching a large database means poorer runtime performance. This presents an interesting dilemma between diversity of animation and production budget. Such issue also exists in other direct data-driven approaches such as motion matching which scales poorly in terms of data (you can read our blog on learned motion matching and how the problem of scalability was addressed). Finally, it is worth mentioning that such approaches do not combine (blend) data for you, pre-recorded animation may not be synchronized with speech rhythm, and it is not easy to control low level characteristics of the motion such as hand height or body movement.

ML-based Approaches

These problems of scaling and synchronicity have motivated research into methods for automatic generation of gestures using machine learning based methods. ML-based models are usually trained on a bunch of speech and/or transcripts paired with corresponding animation in a training phase to learn the mapping. Then during the runtime, the model can generate gestures without needing access to the database. This solves the problem of scalability as all we need to store is a model whose size is independent to the size of training data. However, despite recent research efforts¹,², generating realistic gesture motion using machine learning remains a difficult problem, and addressing expressivity of speaker state and identity by providing control over style is even more challenging.

ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech

Our goal in this research was to design an ML-based framework which:

generates high quality gestures based on speech, 2) is efficient and scalable in runtime, and 3) encodes gesture style with some additional control over lower-level features such as hand and body movement.

System Overview

Let’s first start with a simple scenario where we have a neural network called Speech Encoder. Speech Encoder takes raw speech as input and generates a sequence called Speech Embedding with the same length. Speech Embedding sequence contains required information from audio for generating gesture. This is followed by another network called Gesture Generator which gets the speech embedding sequence and generates gesture which should match the given speech.

Now we might be able to generate gestures from pure speech. However, there is no additional control over stylistic features of generated gestures. In addition, if we train such model on a dataset containing different styles, our network gets confused during training and usually generates the average of these features, something we call mean collapse.

Figure 2 (Generating gesture only from speech)

[La Forge] ZeroEGGS : Zero-Shot example-based gesture generation from speech - Speech Encoder and Gesture Generator IMG

One way to add control over styles is to modulate generated gesture by conditioning the gesture generator network to a style label given by the user.

Figure 3 (Label-based gesture generation)

[La Forge] ZeroEGGS : Zero-Shot example-based gesture generation from speech - Speech Encoder and Gesture Generator - Labels IMG

However, these approaches lack flexibility because they are limited by the content of the training data. They require examples of every style prior to training the model and cannot generalize outside of this range. This means a specific dataset needs to be captured for each style, which leads to a prohibitive amount of work in larger scale applications. Secondly, motion style can be difficult to capture in words, rather, it can be easier to describe style by providing an example of the desired motion style. We might not have an exact definition of “Happy” style as it is different for person to person. Therefore, to scale up, a gesture generation method must be able to capture an individual style with a very limited amount of data, ideally only one example.

To address this, we can add another network called Style Encoder to our model which gets a style example as input and outputs a fixed-length vector which summarizes the style features in the given example. Then, instead of providing a gesture generator with a style label, we can provide it with this embedding. The style encoder learns to provide the best features required for expressing different styles as a fixed-length vector from a low dimensional latent space.

*Figure 4 (Example-based gesture generation)

[La Forge] ZeroEGGS : Zero-Shot example-based gesture generation from speech - Style Encoder IMG

Implementation

Here is a brief description of how each block was designed.

Speech Encoder is composed by two main blocks: a Speech Feature Extractor extracts useful features such as spectrogram and energy from raw audio samples followed by a neural network which converts the extracted features to speech embedding sequence.

*Figure 5 (Speech Encoder)

[La Forge] ZeroEGGS : Zero-Shot example-based gesture generation from speech - Speech Encoder IMG

*Figure 6 (Speech Feature Extractor)

[La Forge] ZeroEGGS : Zero-Shot example-based gesture generation from speech - Audio Feature IMG

The Style Encoder is also made by two blocks. First block is the Animation Feature Extractor which extracts pose features represented by joint local transformations, joint local translational and rotational velocities, and root translational and rotational velocity local to the character root transform. Then the extracted frame-wise features are fed to a neural network to obtain the parameters of VAE posterior distribution. Finally, the style embedding is sampled from the posterior distribution. During the training phase, the style example sequence is sampled from the same animation clip as the target sequence. Therefore, this can be interpreted as a sequence-to-sequence model, conditioned on a compressed representation of style, or an autoencoder conditioned on speech features at each frame. One downside of autoencoders (AE) for such application is that their latent space is non-regularized, and therefore, is not organized in a dense and structured way. Non-regularization means overfitting which may lead to meaningless content if we sample from such latent space. On the other hand, we want our model to learn a dense and disentangled latent space. This way we can sample from subspaces which represent similar styles and also have the power of going beyond the existing styles or interpolating between different styles.

One clear option for us is to convert our autoencoding architecture to a Variational Auto-Encoder (VAE) framework which is known to learn latent spaces that can be disentangled and interpolated. Instead of outputting a single vector, the encoder of VAE outputs the parameters of a pre-defined distribution (posterior distribution). The VAE models then address the issue of non-regularized latent space by having a regularization term that imposes constraints on the latent distribution forcing it be a simple distribution such as a normal distribution (prior distribution). Now we are can explore and sample from this dense latent space and generate new styles for our gesture. That is why we call this model Zero-shot Example-based Gesture Generation from Speech, as it can generate new styles in the runtime from examples with the styles that have not been seen during training.

One last takeaway is that sampling randomly from posterior distribution adds another advantage to our model as it enables the generation of a variety of renditions given the exact same speech and style while sticking to the same energy and intonations, addressing the stochastic nature of gesture motion.

*Figure 7 (Style Encoder)

[La Forge] ZeroEGGS : Zero-Shot example-based gesture generation from speech - Style Embedding Vector IMG

The core of the gesture generator is the Recurrent Decoder, which is an auto-regressive neural network built by two layers of Gated Recurrent Units (GRU). It produces the pose encoding for a new frame from the corresponding speech frame, the reference style embedding vector, and the previous pose state vector. The Update Character State formats the Recurrent Decoder output, computes the pose state, and updates the character facing direction. In addition to the last pose encoding, we condition the Recurrent Decoder to a fixed target facing direction to avoid rotational drifting over time. The Hidden State Initializer is a separate neural network that provides hidden states for the GRU layers based on the initial pose, the character facing direction and the style embedding. We found that using a separate initializing network improved the quality of our results.

*Figure 8 (Recurrent Decoder)

[La Forge] ZeroEGGS : Zero-Shot example-based gesture generation from speech - Flow State IMG

Dataset

[La Forge] ZeroEGGS : Zero-Shot example-based gesture generation from speech - Dataset GIF

To construct a nicely structured and dense latent space, we need a training dataset which covers different characteristics of gesture style. Some of these characteristics are overall posture (e.g. neutral vs curved spine) or dynamic features such as hand and hips movements. Although these are just the obvious ones, and we expect our style encoder to learn other nuances of style characteristics as well. To this end, we collected a dataset of stylized speech and gesture with 19 different styles covering varied characteristics. ZEGGS dataset contains 67 sequences of monologues performed by a female actor speaking in English. Below is a list of these styles along with their length in our dataset:

Style	Length (mins)	Style	Length (mins)
Agreement	5.25	Pensive	6.21
Angry	7.95	Relaxed	10.81
Disagreement	5.33	Sad	11.8
Distracted	5.29	Sarcastic	6.52
Flirty	3.27	Scared	5.58
Happy	10.08	Sneaky	6.27
Laughing	3.85	Still	5.33
Oration	3.98	Threatening	5.84
Neutral	11.13	Tired	7.13
Old	11.37	Total	134.65

Here are some statistics from some of the main style characteristics extracted from the collected data.

*Figure 9 (Hump Average)

[La Forge] ZeroEGGS : Zero-Shot example-based gesture generation from speech - Hump Velocity IMG

*Figure 10 (Hand Velocity Average)

[La Forge] ZeroEGGS : Zero-Shot example-based gesture generation from speech - Hand Velocity IMG

Results

Now let’s see what the results look like. For testing, we retained one recorded sequence from each style and all “Oration” samples as our test set to investigate if our model can generalize to completely new styles. All the video samples are generated using this unseen test dataset (audio and animations).

Different Styles

Below are some samples of generated gestures for different styles. Again, all the audio and style examples are from retained test dataset. Please note that we completely removed “Oration” style from training data and put them in the test set to evaluate the generalizability of our model to new styles in the runtime.

Modelling stochastic nature of human motion

As we mentioned, the probabilistic nature of our style encoder enables the generation of a variety of outputs given the same input. The two samples below were generated from the same speech and style example. This way, narrative designers are given more power to iterate easily over the same speech and style until they get their favourite generated gesture.

Different Speakers & Different Languages

Because ZeroEGGS relies solely on the speech spectrum amplitude, it can be used with different speaker voices and different languages unseen in the training set.

Blending Styles

The variational framework used in our Style Encoder provides a morphable and continuous style embedding space. This allows to mix the styles of multiple samples via linear interpolation. The video below illustrates interpolations of the Old and Neutral styles. As it can be seen, the character’s posture and the hand position gradually change as we interpolate.

Control via PCA Components

We can control some of the low-level style characteristics by projecting the style embedding vector onto a PCA space and manipulating the components. The figure below shows a scatter plot of the first two principal components for non-overlapping samples in different styles. We observe that the first principal component roughly corresponds to body sway. The more static styles such as Still and Sad are located on the left side of the plot, while more dynamic styles, such as Happy and Angry, are located on the right side of the plot. Similarly, the second principal component is associated with hand motion height and radius. For example, Oration samples, for which hands are usually above the shoulders, are located on the upper parts of the plot. On the other hand, styles such as Tired, during which the actor put her hands on her knees, are on the lower parts of the plot. We can modify these gestures characteristics in the PCA space and project them back into the original style embedding space. The video below shows a Neutral style example and its two versions obtained by changing its first principal component. We can see that modifying the first principal component affects root velocity.

Now let’s see how it looks if we change the second principal component. The video below shows how hand movements can be changed by manipulating the second principal component.

FINAL NOTES

This work represents a considerable step in automatic generation of stylized gestures from pure speech. Our system can be used as an addition to the current tools for narrative designers and animators with a much lower cost in terms of data and runtime efficiency. Production-wise, we plan for this first version to be used by Narrative Designers to allow for rapid prototyping and iteration on different styles or outcomes, testing the waters before continuing to work with animators to identify improvements and directions to be explored.

ALEXANDERSON S., HENTER G. E., KUCHERENKO T., BESKOW J.: Style-controllable speech-driven gesture synthesis using normalising flows. In Computer Graphics Forum (2020), vol. 39, Wiley Online Library, pp. 487–496. 1, 2, 7↩
YOON Y., CHA B., LEE J.-H., JANG M., LEE J., KIM J., LEE G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–16. 1, 2↩