Optimizing SAGE Net: Sequential Training of Stratified Diffusion Models and Full-Body Decoder

cover
22 Oct 2025

Abstract and 1. Introduction

  1. Related Work

    2.1. Motion Reconstruction from Sparse Input

    2.2. Human Motion Generation

  2. SAGE: Stratified Avatar Generation and 3.1. Problem Statement and Notation

    3.2. Disentangled Motion Representation

    3.3. Stratified Motion Diffusion

    3.4. Implementation Details

  3. Experiments and Evaluation Metrics

    4.1. Dataset and Evaluation Metrics

    4.2. Quantitative and Qualitative Results

    4.3. Ablation Study

  4. Conclusion and References

Supplementary Material

A. Extra Ablation Studies

B. Implementation Details

B. Implementation Details

B.1 Disentangled VQ-VAE

B.2 Stratified Diffusion

In our transformer-based model for upper-body and lowerbody diffusion, we integrate an additional DiT block as described in [29]. Each model features 12 DiT blocks, each with 8 attention heads, and an input embedding dimension of 512. The full-body decoder is structured with 6 transformer layers.

B.3 Refiner

Table A. Ablation of the input sequence length. The purple background color denotes the motion length used in the original methods. The computational cost is directly proportional to the length of the input sequence, so we select 20 as our choice for the optimal trade-off between performance and computational cost.

Table B. Ablation of the diffusion formulation: Predicting original latent z vs predicting the residual noise ϵ. Predicting clean latent z achieves superior performance. The purple background color denotes our choice.

The complete loss term for training the refiner can be written as:

We set α, β, γ, δ to 0.01, 10, 0.05, and 0.01 to force the refiner to focus more on motion smoothness in the training process.

All experiments can be carried out on a single NVIDIA GeForce RTX 3090 GPU card, using the Pytorch framework.

Authors:

(1) Han Feng, equal contributions, ordered by alphabet from Wuhan University;

(2) Wenchao Ma, equal contributions, ordered by alphabet from Pennsylvania State University;

(3) Quankai Gao, University of Southern California;

(4) Xianwei Zheng, Wuhan University;

(5) Nan Xue, Ant Group ([email protected]);

(6) Huijuan Xu, Pennsylvania State University.


This paper is available on arxiv under CC BY 4.0 DEED license.