Generating high-resolution 3D shapes using volumetric representations such as Signed Distance Functions (SDFs) presents substantial computational and memory challenges. We introduce Direct3D-S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention (SSA) mechanism, which greatly enhances the efficiency of Diffusion Transformer (DiT) computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, yielding a 3.9× speed-up in the forward pass and a 9.6× speed-up in the backward pass. The framework also includes a variational autoencoder (VAE) that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared with prior 3D VAEs that rely on heterogeneous representations, this unified design markedly improves training efficiency and stability. Trained on publicly available datasets, Direct3D-S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 10243 resolution with just 8 GPUs—a task that previously required at least 32 GPUs for 2563 volumetric training—making gigascale 3D generation both practical and accessible.
The framework of our Direct3D-S2. We propose a fully end-to-end sparse SDF VAE (SS-VAE), which employs a symmetric encoder-decoder network to efficiently encode high-resolution sparse SDF volumes into sparse latent representations. Then we train an image-conditioned diffusion transformer (SS-DiT) with a novel Spatial Sparse Attention (SSA) mechanism that significantly improves the training and inference efficiency of the DiT.
The design of our Spatial Sparse Attention (SSA). We partition the input tokens into blocks based on their 3D coordinates, and then construct key-value pairs through three distinct modules. For each query token, we utilize sparse 3D compression module to capture global information, while the spatial blockwise selection module selects important blocks based on compression attention scores to extract fine-grained features, and the sparse 3D window module injects local features. Ultimately, we aggregate the final output of SSA from the three modules using predicted gate scores.
Image | Trellis | Hunyuan-2.0 | TripoSG | Hi3DGen | Ours |
---|---|---|---|---|---|
![]() |
|||||
![]() |
|||||
![]() |
|||||
![]() |
|||||
![]() |
|||||
![]() |
|||||
Closed Source Models |
@article{wu2025direct3ds2gigascale3dgeneration,
title={Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention},
author={Shuang Wu and Youtian Lin and Feihu Zhang and Yifei Zeng and Yikang Yang and Yajie Bao and Jiachen Qian and Siyu Zhu and Philip Torr and Xun Cao and Yao Yao},
journal={arXiv preprint arXiv:2505.17412},
year={2025}
}