in2IN:Leveraging individual Information to Generate Human INteractions

HuMoGen CVPRW 2024
1University of Alicante, 2University of Barcelona, 3Computer Vision Center

📚 Abstract 📚

Generating human-human motion interactions conditioned on textual descriptions is a very useful application in many areas such as robotics, gaming, animation, and the metaverse. Alongside this utility also comes a great difficulty in modeling the highly dimensional inter-personal dynamics. In addition, properly capturing the intra-personal diversity of interactions has a lot of challenges. Current methods generate interactions with limited diversity of intra-person dynamics due to the limitations of the available datasets and conditioning strategies. For this, we introduce in2IN, a novel diffusion model for human-human motion generation which is conditioned not only on the textual description of the overall interaction but also on the individual descriptions of the actions performed by each person involved in the interaction. To train this model, we use a large language model to extend the InterHuman dataset with individual descriptions. As a result, in2IN achieves state-of-the-art performance in the InterHuman dataset. Furthermore, in order to increase the intra-personal diversity on the existing interaction datasets, we propose DualMDM, a model composition technique that combines the motions generated with in2IN and the motions generated by a single-person motion prior pre-trained on HumanML3D. As a result, DualMDM generates motions with higher individual diversity and improves control over the intra-person dynamics while maintaining inter-personal coherence.

👫 in2IN: Interaction Diffusion Model 👫

in2IN is a novel diffusion model architecture that is not only conditioned on the overall interaction description but also on the descriptions of the individual motion performed by each interactant. To do so, we extend the InterHuman dataset with LLM-generated textual descriptions of the individual human motions involved in the interaction. Our approach allows for a more precise interaction generation and achieves 🥇 state-of-the-art 🥇 results on InterHuman.

Architecture

Our proposed architecture consists of a Siamese Transformer that generates the denoised motion of each individual in the interaction (\(x^0_a\) and \(x^0_b\)). First, a self-attention layer models the intra-personal dependencies using the encoded individual condition and noisy motion of each person (\(x^t_a\) and \(x^t_b\)). Then, a cross-attention module models the inter-personal dynamics using the encoded interaction description, the self-attention output, and the noisy motion from the other interacting person.

Multi-Weight Classifier Free Guidance

Our conditioning strategy for the diffusion model builds upon CFG, initially proposed by Ho et al. Generally, diffusion models have a significant dependency on CFG to generate high-quality samples. However, incorporating multiple conditions using CFG is not trivial. We address this by employing distinct weighting strategies for each condition. The equation representing our model's sampling function, denoted as \(G^{I}(x^{t}, t, c)\), is as follows:

\[ \begin{aligned} G^{I}\left(x^{t}, t, c\right) &= G\left(x^{t}, t, \emptyset\right) \\ & + w_c \cdot \left(G\left(x^{t}, t, c\right) - G\left(x^{t}, t, \emptyset\right)\right) \\ & + w_I \cdot \left(G\left(x^{t}, t, c_{\text{I}}\right) - G\left(x^{t}, t, \emptyset\right)\right) \\ & + w_i \cdot \left(G\left(x^{t}, t, c_{\text{i}}\right) - G\left(x^{t}, t, \emptyset\right)\right), \end{aligned} \]

where \(G(x^{t}, t, \emptyset)\) is the unconditional output of the model, and \(G(x^{t}, t, c)\), \(G(x^{t}, t, c_{\text{I}})\), and \(G(x^{t}, t, c_{\text{i}})\) denote the model outputs conditioned on the whole conditioning \(c = \{c_I,c_i\}\), only the interaction, and only the individual, respectively. The weights \(w_c\), \(w_I\), and \(w_i\) \(\in \mathbb{R}\) adjust the influence of each conditioned output relative to the unconditional baseline. A notable limitation of this approach is the necessity to perform quadruple sampling from the denoiser, as opposed to the double sampling required in a conventional CFG methodology. In exchange, this method allows for more refined control over the generation process, ensuring that the model can effectively capture and express the nuances of both individual and interaction-specific conditions. If a weight is set to 0, then that particular conditioning is ignored during the generation process.

💃 DualMDM: Model Composition 👫

DualMDM is a new motion composition technique to further increase the individual diversity and control. By combining our in2IN interaction model with a single-person (individual) motion prior, we generate interactions with more diverse intra-personal dynamics.

Schedulers

We propose a motion model composition technique that allows us to combine interactions generated by an interaction model with the motions generated by an individual motion prior trained with a single-person motion dataset. This method uses a single-person human motion prior to provide the generated human-human interactions with a higher diversity of intra-personal dynamics.

\[ \begin{aligned} G^{I,i}(x^t, t, c) &= G^{\text{I}}(x^t, t, c) \\ &+ w \cdot (G^{\text{i}}(x^t, t, c_i) - G{^{\text{I}}}(x^t, t, c)), \end{aligned} \]

where \(G^{\text{I}}(x^t, t, c)\) is the output of the interaction diffusion model, \(G^{\text{i}}(x^t, t, c_i)\) is the output of the individual motion prior, and \(w\in \mathbb{R}\) is the blending weight. By choosing \(w\) to be constant, authors from DiffusionBlending assumed that the optimal blending weight is the same along the whole sampling process. However, we argue that the optimal blending weight might vary along the denoising chain, depending on the particularities of each scenario. To account for this, we propose to replace the constant \(w\) with a weight scheduler \(w(t)\) that parameterizes the blending weight used to combine the denoised motion from both models, making it variable on the sampling phase. As a generalization of the DiffusionBlending technique, DualMDM is a more flexible and powerful strategy to combine two diffusion models.

BibTeX

@InProceedings{Ruiz-Ponce_2024_CVPR,
    author    = {Ruiz-Ponce, Pablo and Barquero, German and Palmero, Cristina and Escalera, Sergio and Garc{\'\i}a-Rodr{\'\i}guez, Jos\'e},
    title     = {in2IN: Leveraging Individual Information to Generate Human INteractions},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
    month     = {June},
    year      = {2024},
    pages     = {1941-1951}
}