Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images

Tianhao Wu\(^{1*}\), Chuanxia Zheng\(^{2\dagger}\) Frank Guan\(^3\), Andrea Vedaldi\(^2\), Tat-Jen Cham\(^1\),

\(^*\)S-Lab,\(^1\)Nanyang Technological University; \(^2\)Visual Geometry Group, University of Oxford; \(^3\)Singapore Institute of Technology;

\(^{\dagger}\)Project Lead



TL;DR: Given partially visible objects within images, Amodal3R reconstructs semantically meaningful 3D assets with reasonable geometry and plausible appearance.

Abstract: Most image-based 3D object reconstructors assume that objects are fully visible, ignoring occlusions that commonly occur in real-world scenarios. In this paper, we introduce Amodal3R, a conditional 3D generative model designed to reconstruct 3D objects from partial observations. We start from a "foundation" 3D generative model and extend it to recover plausible 3D geometry and appearance from occluded objects. We introduce a mask-weighted multi-head cross-attention mechanism followed by an occlusion-aware attention layer that explicitly leverages occlusion priors to guide the reconstruction process. We demonstrate that, by training solely on synthetic data, Amodal3R learns to recover full 3D objects even in the presence of occlusions in real scenes. It substantially outperforms existing methods that independently perform 2D amodal completion followed by 3D reconstruction, thereby establishing a new benchmark for occlusion-aware 3D reconstruction.

Examples


Compared with 2D amodal completion + 3D reconstruction, Amodal3R achieves better performance in terms of 3D reconstruction quality from occluded object. The target objects and occluders are marked with red and green outlines.


Input image

Amodal3R

Input image

Amodal3R

Input Image
Input Image

Input Image
Input Image

Input Image
Input Image

Input Image
Input Image

Methodology

Methodology Illustration

Overview: Given an image as input and point prompts in the regions of interest, Amodal3R first extracts the partially visible target object, along with the visibility and occlusion masks using an off-the-shelf 2D segmenter. It then applies DINOv2 to extract features cdino as additional conditioning for the 3D reconstructor. To enhance occlusion reasoning, each transformer block incorporates a mask- weighted cross-attention (via \(c_{vis}\)) and occlusion-aware attention layer (via \(c_{occ}\)), ensuring the 3D reconstructor accurately perceives visible information while effectively inferring occluded parts.

Related Links

Check these concurrent works we found, which provide thought-provoking ideas towards this direction:
  • TRELLIS. A native 3D generative model built on a unified Structured Latent representation and Rectified Flow Transformers, enabling versatile and high-quality 3D asset creation.
  • GaussianAnything. GaussianAnything generates high-quality and editable surfel Gaussians through a cascaded 3D diffusion pipeline, given single-view images or texts as the conditions.
  • Real3D. Real3D scales up training data of single-view LRMs by enabling self-training on in-the-wild images.
  • LaRa. LaRa is a feed-forward 2DGS model trained in two days using 4 GPUs.