ClusteringSDF: Self-Organized Neural
Implicit Surfaces for 3D Decomposition

Tianhao Wu, Chuanxia Zheng, Cham Tat Jen, Qianyi Wu,

▶ Nanyang Technological University; ▶ University of Oxford; ▶ Monash University;

ClusteringSDF is able to reconstruct the surface of the scene while fusing the inconsistent machine-generated segments to be coherent in 3D. It learns an object-compositional neural implicit representation and can reconstruct the surfaces of individual objects purely from these noisy labels.

Abstract

ClusteringSDF is designed to fuse inconsistent 2D segments to 3D while reconstructing the object geometry by predicting their SDFs. To achieve this, we sample rays from single 2D segment maps and split them into N groups corresponding to the N distinct 2D labels \(\{C_1,...,C_N\}\). The proposed \(L_{diff}\) then leverages normalized SDF distributions encompassing c channels for individual objects (presented by different colors) and keeps the clustering centers to be apart, with \(L_{onehot}\) further encouraging the predicted clusters to be in the one-hot format.

3D decomposition/segmentation still remains a challenge as large-scale 3D annotated data is not readily available. Contemporary approaches typically leverage 2D machine-generated segments, integrating them for 3D consistency. While the majority of these methods are based on NeRFs, they face a potential weakness that the instance/semantic embedding features derive from independent MLPs, thus preventing the segmentation network from learning the geometric details of the objects directly through radiance and density. Therefore, In this paper, we propose ClusteringSDF, a novel approach to achieve both segmentation and reconstruction in 3D via the neural implicit surface representation, specifically Signal Distance Function (SDF), where the segmentation rendering is directly integrated with the volume rendering of neural implicit surfaces. Although based on ObjectSDF++, ClusteringSDF no longer requires the ground-truth segments for supervision while maintaining the capability of reconstructing individual object surfaces, but purely with the noisy and inconsistent labels from pre-trained models. As the core of ClusteringSDF, we introduce a high-efficient clustering mechanism for lifting the 2D labels to 3D and the experimental results on the challenging scenes from ScanNet and Replica datasets show that ClusteringSDF can achieve competitive performance compared against the state-of-the-art with significantly reduced training time.

Results Gallery

Multi-View Segmentation Comparisons

Columns 2-4: semantic segmentation. Columns 5-7: instance segmentation. Segmentation comparisons from multiple camera views demonstrate that our segmentation is consistent across multiple views. On the other hand, it can also be observed that the segmentation results of ClusteringSDF are closer to the actual geometry of the objects (the areas marked by the red boxes) against the state-of-the-art Contrastive Lift.

Object Surface Reconstruction

Object reconstruction results of scene

Reference image

ObjectSDF++

ClusteringSDF

As ClusteringSDF is built upon object-compositional neural implicit surface representation, while our \(L_{onehot}\) encourages the model to assign each pixel to distinct SDF channels, it maintains the capability of reconstructing the surface of individual objects in the scene. Note that as ObjectSDF++ uses ground-truth labels for supervision, it imposes more detailed segmentation of some objects, resulting in some of the results appearing to be inferior to our approach.

Related Links

Check these concurrent works we found, which provide thought-provoking ideas towards this direction:

ObjectSDF. An framework to build an object-compositional neural implicit representation with high fidelity in 3D reconstruction and object representation.

ObjectSDF++. An improved framework to overcome the limitations of ObjectSDF with an occlusion-aware object opacity rendering formulation.

Panoptic Lifting. A framework to learn panoptic 3D volumetrirepresentations from 2D multiview panoptic segmentation masks.

Contrastive Lift. A framework to lift 2D segments to 3D and fuse them by means of a neural field representation via a slow-fast clustering objective function.

BibTeX

@misc{wu2024clusteringsdf,
      title={ClusteringSDF: Self-Organized Neural Implicit Surfaces for 3D Decomposition}, 
      author={Tianhao Wu and Chuanxia Zheng and Tat-Jen Cham and Qianyi Wu},
      year={2024},
      eprint={2403.14619},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}