Optimizing SAGE Net: Sequential Training of Stratified Diffusion Models and Full-Body Decoder

22 Oct 2025

Table of Links

Abstract and 1. Introduction

Related Work

2.1. Motion Reconstruction from Sparse Input

2.2. Human Motion Generation
SAGE: Stratified Avatar Generation and 3.1. Problem Statement and Notation

3.2. Disentangled Motion Representation

3.3. Stratified Motion Diffusion

3.4. Implementation Details
Experiments and Evaluation Metrics

4.1. Dataset and Evaluation Metrics

4.2. Quantitative and Qualitative Results

4.3. Ablation Study
Conclusion and References

Supplementary Material

A. Extra Ablation Studies

B. Implementation Details

B.1 Disentangled VQ-VAE

B.2 Stratified Diffusion

In our transformer-based model for upper-body and lowerbody diffusion, we integrate an additional DiT block as described in [29]. Each model features 12 DiT blocks, each with 8 attention heads, and an input embedding dimension of 512. The full-body decoder is structured with 6 transformer layers.

B.3 Refiner

Table A. Ablation of the input sequence length. The purple background color denotes the motion length used in the original methods. The computational cost is directly proportional to the length of the input sequence, so we select 20 as our choice for the optimal trade-off between performance and computational cost.

Table B. Ablation of the diffusion formulation: Predicting original latent z vs predicting the residual noise ϵ. Predicting clean latent z achieves superior performance. The purple background color denotes our choice.

The complete loss term for training the refiner can be written as:

We set α, β, γ, δ to 0.01, 10, 0.05, and 0.01 to force the refiner to focus more on motion smoothness in the training process.

All experiments can be carried out on a single NVIDIA GeForce RTX 3090 GPU card, using the Pytorch framework.

Authors:

(1) Han Feng, equal contributions, ordered by alphabet from Wuhan University;

(2) Wenchao Ma, equal contributions, ordered by alphabet from Pennsylvania State University;

(3) Quankai Gao, University of Southern California;

(4) Xianwei Zheng, Wuhan University;

(5) Nan Xue, Ant Group ([email protected]);

(6) Huijuan Xu, Pennsylvania State University.

This paper is available on arxiv under CC BY 4.0 DEED license.

← Previous

SAGE Net Ablation Study: Analyzing the Impact of Input Sequence Length on Performance