ClusteringSDF is designed to fuse inconsistent 2D segments to 3D while reconstructing the object geometry by predicting their SDFs. To achieve this, we sample rays from single 2D segment maps and split them into N groups corresponding to the N distinct 2D labels \(\{C_1,...,C_N\}\). The proposed \(L_{diff}\) then leverages normalized SDF distributions encompassing c channels for individual objects (presented by different colors) and keeps the clustering centers to be apart, with \(L_{onehot}\) further encouraging the predicted clusters to be in the one-hot format.
3D decomposition/segmentation still remains a challenge as large-scale 3D annotated data is not readily available. Contemporary approaches typically leverage 2D machine-generated segments, integrating them for 3D consistency. While the majority of these methods are based on NeRFs, they face a potential weakness that the instance/semantic embedding features derive from independent MLPs, thus preventing the segmentation network from learning the geometric details of the objects directly through radiance and density. Therefore, In this paper, we propose ClusteringSDF, a novel approach to achieve both segmentation and reconstruction in 3D via the neural implicit surface representation, specifically Signal Distance Function (SDF), where the segmentation rendering is directly integrated with the volume rendering of neural implicit surfaces. Although based on ObjectSDF++, ClusteringSDF no longer requires the ground-truth segments for supervision while maintaining the capability of reconstructing individual object surfaces, but purely with the noisy and inconsistent labels from pre-trained models. As the core of ClusteringSDF, we introduce a high-efficient clustering mechanism for lifting the 2D labels to 3D and the experimental results on the challenging scenes from ScanNet and Replica datasets show that ClusteringSDF can achieve competitive performance compared against the state-of-the-art with significantly reduced training time.
Columns 2-4: semantic segmentation. Columns 5-7: instance segmentation. Segmentation comparisons from multiple camera views demonstrate that our segmentation is consistent across multiple views. On the other hand, it can also be observed that the segmentation results of ClusteringSDF are closer to the actual geometry of the objects (the areas marked by the red boxes) against the state-of-the-art Contrastive Lift.
As ClusteringSDF is built upon object-compositional neural implicit surface representation, while our \(L_{onehot}\) encourages the model to assign each pixel to distinct SDF channels, it maintains the capability of reconstructing the surface of individual objects in the scene. Note that as ObjectSDF++ uses ground-truth labels for supervision, it imposes more detailed segmentation of some objects, resulting in some of the results appearing to be inferior to our approach.
@misc{wu2024clusteringsdf,
title={ClusteringSDF: Self-Organized Neural Implicit Surfaces for 3D Decomposition},
author={Tianhao Wu and Chuanxia Zheng and Tat-Jen Cham and Qianyi Wu},
year={2024},
eprint={2403.14619},
archivePrefix={arXiv},
primaryClass={cs.CV}
}