Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images

Tianhao Wu\(^{1*}\), Chuanxia Zheng\(^{2\dagger}\) Frank Guan\(^3\), Andrea Vedaldi\(^2\), Tat-Jen Cham\(^1\),

\(^*\)S-Lab,\(^1\)Nanyang Technological University; \(^2\)Visual Geometry Group, University of Oxford; \(^3\)Singapore Institute of Technology;

\(^{\dagger}\)Project Lead

Paper Code Model Demo

TL;DR: Given partially visible objects within images, Amodal3R reconstructs semantically meaningful 3D assets with reasonable geometry and plausible appearance.

Abstract: Most image-based 3D object reconstructors assume that objects are fully visible, ignoring occlusions that commonly occur in real-world scenarios. In this paper, we introduce Amodal3R, a conditional 3D generative model designed to reconstruct 3D objects from partial observations. We start from a "foundation" 3D generative model and extend it to recover plausible 3D geometry and appearance from occluded objects. We introduce a mask-weighted multi-head cross-attention mechanism followed by an occlusion-aware attention layer that explicitly leverages occlusion priors to guide the reconstruction process. We demonstrate that, by training solely on synthetic data, Amodal3R learns to recover full 3D objects even in the presence of occlusions in real scenes. It substantially outperforms existing methods that independently perform 2D amodal completion followed by 3D reconstruction, thereby establishing a new benchmark for occlusion-aware 3D reconstruction.

Examples

Compared with 2D amodal completion + 3D reconstruction, Amodal3R achieves better performance in terms of 3D reconstruction quality from occluded object. The target objects and occluders are marked with red and green outlines.

Input image

Amodal3R

Input image

Amodal3R

Methodology

Overview: Given an image as input and point prompts in the regions of interest, Amodal3R first extracts the partially visible target object, along with the visibility and occlusion masks using an off-the-shelf 2D segmenter. It then applies DINOv2 to extract features cdino as additional conditioning for the 3D reconstructor. To enhance occlusion reasoning, each transformer block incorporates a mask- weighted cross-attention (via \(c_{vis}\)) and occlusion-aware attention layer (via \(c_{occ}\)), ensuring the 3D reconstructor accurately perceives visible information while effectively inferring occluded parts.

Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images

TL;DR: Given partially visible objects within images, Amodal3R reconstructs semantically meaningful 3D assets with reasonable geometry and plausible appearance.

Examples

Input image

Amodal3R

Input image

Amodal3R

Methodology

Related Links