Our conditioning strategy for the diffusion model builds upon CFG, initially proposed by Ho et al. Generally, diffusion models have a significant dependency on CFG to generate high-quality samples. However, incorporating multiple conditions using CFG is not trivial. We address this by employing distinct weighting strategies for each condition. The equation representing our model's sampling function, denoted as \(G^{I}(x^{t}, t, c)\), is as follows:
\[
\begin{aligned}
G^{I}\left(x^{t}, t, c\right) &= G\left(x^{t}, t, \emptyset\right) \\
& + w_c \cdot \left(G\left(x^{t}, t, c\right) - G\left(x^{t}, t, \emptyset\right)\right) \\
& + w_I \cdot \left(G\left(x^{t}, t, c_{\text{I}}\right) - G\left(x^{t}, t, \emptyset\right)\right) \\
& + w_i \cdot \left(G\left(x^{t}, t, c_{\text{i}}\right) - G\left(x^{t}, t, \emptyset\right)\right),
\end{aligned}
\]
where \(G(x^{t}, t, \emptyset)\) is the unconditional output of the model, and \(G(x^{t}, t, c)\), \(G(x^{t}, t, c_{\text{I}})\), and \(G(x^{t}, t, c_{\text{i}})\) denote the model outputs conditioned on the whole conditioning \(c = \{c_I,c_i\}\), only the interaction, and only the individual, respectively. The weights \(w_c\), \(w_I\), and \(w_i\) \(\in \mathbb{R}\) adjust the influence of each conditioned output relative to the unconditional baseline. A notable limitation of this approach is the necessity to perform quadruple sampling from the denoiser, as opposed to the double sampling required in a conventional CFG methodology. In exchange, this method allows for more refined control over the generation process, ensuring that the model can effectively capture and express the nuances of both individual and interaction-specific conditions. If a weight is set to 0, then that particular conditioning is ignored during the generation process.