Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction

Zhang, Z; Sun, L; Yang, Z; Chen, L; Yang, Y

Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction

Zhang, Z Sun, L Yang, Z Chen, L

Yang, Y

Permalink

Publication Type:: Conference Proceeding
Citation:: Advances in Neural Information Processing Systems, 2023, 36
Issue Date:: 2023-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (12.98 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, Z
dc.contributor.author	Sun, L
dc.contributor.author	Yang, Z
dc.contributor.author	Chen, L https://orcid.org/0000-0002-6468-5729
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.date	2023-12-10
dc.date.accessioned	2024-06-17T05:07:47Z
dc.date.available	2024-06-17T05:07:47Z
dc.date.issued	2023-01-01
dc.identifier.citation	Advances in Neural Information Processing Systems, 2023, 36
dc.identifier.issn	1049-5258
dc.identifier.uri	http://hdl.handle.net/10453/179531
dc.description.abstract	Reconstructing 3D clothed avatars from single images is a challenging task, especially when encountering complex poses and loose clothing. Current methods exhibit limitations in performance, largely attributable to their dependence on insufficient 2D image features and inconsistent query methods. Owing to this, we present Global-correlated 3D-decoupling Transformer for clothed Avatar reconstruction (GTA), a novel transformer-based architecture that reconstructs clothed human avatars from monocular images. Our approach leverages transformer architectures by utilizing a Vision Transformer model as an encoder for capturing global-correlated image features. Subsequently, our innovative 3D-decoupling decoder employs cross-attention to decouple tri-plane features, using learnable embeddings as queries for cross-plane generation. To effectively enhance feature fusion with the tri-plane 3D feature and human body prior, we propose a hybrid prior fusion strategy combining spatial and prior-enhanced queries, leveraging the benefits of spatial localization and human body prior knowledge. Comprehensive experiments on CAPE and THuman2.0 datasets illustrate that our method outperforms state-of-the-art approaches in both geometry and texture reconstruction, exhibiting high robustness to challenging poses and loose clothing, and producing higher-resolution textures. Codes are available at https://github.com/River-Zhang/GTA.
dc.language	en
dc.relation.ispartof	Advances in Neural Information Processing Systems
dc.relation.ispartof	Conference on Neural Information Processing Systems
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	1701 Psychology, 1702 Cognitive Sciences
dc.subject.classification	4611 Machine learning
dc.title	Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction
dc.type	Conference Proceeding
utslib.citation.volume	36
utslib.location.activity	New Orleans, USA
utslib.for	1701 Psychology
utslib.for	1702 Cognitive Sciences
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	open_access	*
pubs.consider-herdc	false
dc.date.updated	2024-06-17T05:07:45Z
pubs.finish-date	2023-12-16
pubs.publication-status	Published
pubs.start-date	2023-12-10
pubs.volume	36

Abstract:

Reconstructing 3D clothed avatars from single images is a challenging task, especially when encountering complex poses and loose clothing. Current methods exhibit limitations in performance, largely attributable to their dependence on insufficient 2D image features and inconsistent query methods. Owing to this, we present Global-correlated 3D-decoupling Transformer for clothed Avatar reconstruction (GTA), a novel transformer-based architecture that reconstructs clothed human avatars from monocular images. Our approach leverages transformer architectures by utilizing a Vision Transformer model as an encoder for capturing global-correlated image features. Subsequently, our innovative 3D-decoupling decoder employs cross-attention to decouple tri-plane features, using learnable embeddings as queries for cross-plane generation. To effectively enhance feature fusion with the tri-plane 3D feature and human body prior, we propose a hybrid prior fusion strategy combining spatial and prior-enhanced queries, leveraging the benefits of spatial localization and human body prior knowledge. Comprehensive experiments on CAPE and THuman2.0 datasets illustrate that our method outperforms state-of-the-art approaches in both geometry and texture reconstruction, exhibiting high robustness to challenging poses and loose clothing, and producing higher-resolution textures. Codes are available at https://github.com/River-Zhang/GTA.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/179531