Hier-SLAM++: Neuro-Symbolic Semantic SLAM with a Hierarchically Categorical Gaussian Splatting

1Faculty of Information Technology, Monash University 2VinUniversity 3Mohamed bin Zayed University of Artificial Intelligence
Teaser Image

We present Hier-SLAM++, a neuro-symbolic semantic 3D Gaussian Splatting SLAM system supporting RGB-D and monocular inputs, utilizing a compact hierarchical symbolic representation that encodes semantic and geometric information via LLMs and 3D generative models.

Abstract

We propose Hier-SLAM++, a comprehensive NeuroSymbolic semantic 3D Gaussian Splatting SLAM method with both RGB-D and monocular input featuring an advanced hierarchical categorical representation, which enables accurate pose estimation as well as global 3D semantic mapping.

The parameter usage in semantic SLAM systems increases significantly with the growing complexity of the environment, making scene understanding particularly challenging and costly. To address this problem, we introduce a novel and general hierarchical representation that encodes both semantic and geometric information in a compact form into 3D Gaussian Splatting, leveraging the capabilities of large language models (LLMs) as well as the 3D generative model. By utilizing the proposed hierarchical tree structure, semantic information is symbolically represented and learned in an end-to-end manner. We further introduce a novel semantic loss designed to optimize hierarchical semantic information through both inter-level and cross-level optimization.

Additionally, we propose an improved SLAM system to support both RGB-D and monocular inputs using a feed-forward model. To the best of our knowledge, this is the first semantic monocular Gaussian Splatting SLAM system, significantly reducing sensor requirements for 3D semantic understanding and broadening the applicability of semantic Gaussian SLAM system. We conduct experiments on both synthetic and real-world datasets, demonstrating superior or on-par performance with state-ofthe-art NeRF-based and Gaussian-based SLAM systems, while significantly reducing storage and training time requirements.

Pipeline Overview

Within Hier-SLAM++, we model the semantic space as a hierarchical tree, where semantic information is represented as a root-to-leaf hierarchical symbolic structure. This representation not only comprehensively captures the hierarchical attributes of semantic information, but also effectively compresses storage requirements, while enhancing semantic understanding with valuable scaling-up capability. To construct an effective tree, we leverage both LLMs and 3D Generative Models to integrate semantic and geometric information. Next, each 3D Gaussian primitive is augmented with a hierarchical semantic embedding code, which can be learned in an endto-end manner through the proposed Inter-level and Crosslevel semantic losses. Furthermore, our method is extended to support monocular input by incorporating geometric priors extracted from the feed-forward method, thereby eliminating the dependency on depth information

Overview of the Hier-SLAM++ pipeline. Left: Hierarchical representation of semantic information. The Tree Generation process leverages the capabilities of both LLMs and 3D Generative Models. This hierarchical tree is used to establish a symbolic coding for each Gaussian primitive. Additionally, we introduce a novel loss function that combines Intra-level Loss and Inter-level Loss to optimize the hierarchical semantic representation. Bottom Left: An example of hierarchical semantic rendering. Right: The global 3D Gaussian map is initialized using the first frame of the video stream input. The system then alternates between the Tracking and Mapping steps as new frames are processed. In the RGB-D setting, depth is directly obtained from sensor input, whereas in the monocular setting the 3D feed-forward method (DUSt3R) is used to generate a geometric prior for depth estimation.

Replica Room 0

RGB

Semantic

Replica Room 1

RGB

Semantic

Replica Room 2

RGB

Semantic

Replica Office 0

RGB

Semantic

Image Rendering

The visual comparison clearly demonstrates that our method achieves superior rendering quality, closely resembling the ground truth images compared to state-of-the-art approaches on Replica dataset.

Semantic Rendering

The first four rows demonstrate rendered semantic segmentation in a coarse-to-fine manner. The fifth row exhibits the finest semantic rendering, equivalent to the flat representation with 102 original semantic classes from the Replica dataset.

BibTeX

@misc{li2025hierslamneurosymbolicsemanticslam,
      title={Hier-SLAM++: Neuro-Symbolic Semantic SLAM with a Hierarchically Categorical Gaussian Splatting}, 
      author={Boying Li and Vuong Chi Hao and Peter J. Stuckey and Ian Reid and Hamid Rezatofighi},
      year={2025},
      eprint={2502.14931},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2502.14931}, 
}