Real-time Spatial AI: Multi-object tracking, mapping and recognition in dynamic environments


Research team:

The Ph.D. student activities will take place jointly at the Robot Vision group ( at the I3S-CNRS laboratory of the University of Cote d'Azur and at the Pattern Analysis and Computer Vision (PAVIS, Research Line at Istituto Italiano di Tecnologia (IIT) under the supervision of Dr. Alessio Del Bue (IIT) and Dr. Andrew Comport (UCA). 

Context :

Deep learning is at the heart of recent advances in visual perception allowing large amounts of training data to be transformed into compact forms that can be reused online to provide prior knowledge. Real-time visual sensing algorithms form the basis for spatially intelligent systems such as augmented reality interfaces or autonomous robotics. This project is situated at the intersection between visual localization, mapping and life-long learning.

Color and depth hardware sensors have recently proven essential for spatially aware systems. This project therefore aims at directly exploiting RGB-D cameras in a sensor based approach for end-to-end training of deep CNN approaches combined with unsupervised 3D reconstruction systems.

The research objective is to propose and study new paradigms and concepts for real-time spatial intelligence in a dynamic environment by developing techniques to segment, map and track multiple objects in 3D environments  and to take into account prior knowledge of the environment and use it to reason about the 3D scene layout for various applications such as human-based navigation, object re-localisation, interactive augmented reality, robot navigation in crowded environments, etc.


PhD Subject:

Spatial intelligence is an area that deals with spatial judgment and the ability to visualize with the mind's eye. It enables solving the problems of navigation, visualization of objects from different angles and spaces or scene recognition, or to notice fine details. The aim of this PhD topic is therefore to tackle this challenging topic through the fundamentals of visual object reconstruction whilst taking into account the added difficulty of real-time computational efficiency.

This research will be focused on the particular element of improving 3D object reconstruction with semantic reasoning. Classically, visual localization and mapping (SLAM) solutions have focused on directly exploiting dense point clouds provided by RGB-D sensors [1]. On the other hand, object recognition approaches based on machine learning have been extensively studied [2].

A fundamental objective of this thesis will aim at developing an efficient 3D map representation for multiple objects that takes into account incremental sensor pose uncertainty. This will involve taking a live stream of 3D point clouds from a RGB-D sensor undergoing movement and mapping segmented objects. This multi-object map representation should provide for incremental reconstruction of the scene and its semantic, provide for multiple resolutions and be adapted to take into account prior knowledge. A more compact representation should allow for improved real-time performance.

Whilst 3D object tracking and mapping is a widely studied problem, very few works have been proposed to exploit large-scale prior information about the scene within the 3D reconstruction process. In this thesis, the central objective will be to exploit this prior knowledge about parts of the scene to assist the mapping process. A state-of-the-art deep learning technique for performing semantic localization and mapping [3] will provide the basis for injecting prior information about the scene objects in order to improve pose estimation and reconstruction precision, efficiency and robustness. The completion of hidden object parts along with a generative approach for completing known objects [4] will be used to form the basis of a new approach to mapping that integrates prior knowledge of the environment and objects.

Work plan:

The main goals of this thesis can be broken into the following three stages:

  • Real-time 3D mapping: developing techniques to acquire a compact multi-object 3D environment representation.

  • Spatial Learning: taking into account prior knowledge of the environment and using it to infer higher level semantics and occluded scene elements.

  • Online Learning: combining 3D mapping with a learnt spatial knowledge for large-scale environment mapping.


[1] On unifying key-frame and voxel-based dense visual SLAM at large scales, Maxime Meilland and Andrew I. Comport, International Conference on Intelligent Robots and Systems, 2013, Tokyo, Japan.

[2] 3d object localisation from multi-view image detections, C Rubino, M Crocco, A Del Bue, IEEE transactions on pattern analysis and machine intelligence 40 (6), 1281-1294

[3] Category Level Object Pose Estimation via Neural Analysis-by-Synthesis, AuthorsX. Chen, Z. Dong, J. Song, A. Geiger, O. Hilliges, In ProceedingsEuropean Conference on Computer Vision (ECCV), 2020

[4] Dario Rethage, Federico Tombari, Felix Achilles and Nassir Navab, Deep Learned Full-3D Object Completion from Single View, arXiv, 2018.