Abstract

In this paper, we introduce a novel formulation for camera motion estimation that integrates RGB-D images and inertial data through scene flow. Our goal is to accurately estimate the camera motion in a rigid 3D environment, along with the state of the inertial measurement unit (IMU). Our proposed method offers the flexibility to operate as a multiframe optimization or to marginalize older data, thus effectively utilizing past measurements. To assess the performance of our method, we conducted evaluations using both synthetic data from the ICL-NUIM dataset and real data sequences from the OpenLORIS-Scene dataset. Our results show that the fusion of these two sensors enhances the accuracy of camera motion estimation when compared to using only visual data.

Figure. Motion estimation in an office scene from the ICL-NUIM dataset in two different times t1 and t2. (a) 3D representation of the scene. (b) Motion estimation of the objects in the scene. Every velocity is represented by a red arrow on each point. (c) Zoomed-in areas.

Optimization problem

In order to estimate the motion field we formulate an optimization problem over the state $\mathbf{x}$ for which the camera velocity consistency is imposed as well as those terms corresponding to the pre-integration of the IMU readings. The joint optimization problem will consist on minimizing a cost function $J(\mathbf{x})$ which is the summation of terms associated to the inertial measurements $J_{i}$ as well as to the camera measurements $J_{c}$. Our state estimate $\hat{\mathbf{x}}$ will be the one that minimizing the cost function $J(\mathbf{x})$.

\begin{align} \hat{\mathbf{x}} = min_{\mathbf{x}} J(\mathbf{x}) = min_{\mathbf{x}} \left( J_{c}(\mathbf{x}) + J_{i}(\mathbf{x}) \right) \end{align}

As we add more frames, the result is a sliding window of $N$ frames moving along the camera trajectory. In the general case, the cost function $J(\mathbf{x})$ can be expressed compactly using as follows: \begin{equation} J(\mathbf{x})= \sum_{p=i}^{i+N-1}\left(\mathbf{r}_{c_{p}}^\top{\boldsymbol\Sigma}_{c_{p}}^{-1}\mathbf{r}_{c_{p}} + \mathbf{r}_{\Delta \mathbf{v}_{p}}^\top{\boldsymbol\Sigma}_{\Delta \mathbf{v}_{p}}^{-1} \mathbf{r}_{\Delta \mathbf{v}_{p}}\right) + \mathbf{r}_{bg}^\top{\boldsymbol\Sigma}_{\boldsymbol\omega}^{-1} \mathbf{r}_{bg} + \mathbf{r}_{ba}^\top{\boldsymbol\Sigma}_{\mathbf{a}}^{-1} \mathbf{r}_{ba} + \sum_{l=i}^{i+N}\mathbf{r}_{\omega_l }^\top{\boldsymbol\Sigma}_{\boldsymbol\omega}^{-1}\mathbf{r}_{\omega_l} \end{equation} and the state $\mathbf{x} \in \mathbb{R}^{6N+8}$ is defined as: \begin{equation} \mathbf{x} = \left[ {\mathbf{v}}_i^\top,{\boldsymbol\omega}_i^\top,\dots,{\mathbf{v}}_{i+N-1}^\top,{\boldsymbol\omega}_{i+N-1}^\top,{\mathbf{g}}^\top,{\mathbf{b}^g}^\top, {\mathbf{b}^a}^\top \right]^\top \end{equation}

Marginalization

Consider the case in following figure, using a 3-frames-sliding-window. When the $l$-frame comes, the optimization is done. However, in order to keep a 3-frame window, when the next frame $l+1$ comes, we need to marginalize out $\mathbf{v}_i$ and $\boldsymbol{\omega}_i$.

The Hessian $\mathbf{H}$ contains the second derivatives of the cost function with respect to the state variables, that encodes how every state variable affects the others. We denote as $\alpha$ the block of variables we would like to marginalize, and $\beta$ the block of variables we would like to keep. When marginalizing a set $\alpha$ of variables, we gather all factors dependent on them as well as the connected variables $\beta$. This is done by means of the $\textit{Schur Complement}$, which is defined as follows. \begin{equation} \mathbf{H}^* = \mathbf{H}_{\beta\beta} - \mathbf{H}_{\alpha\beta}^\top\mathbf{H}_{\alpha\alpha}^{-1}\mathbf{H}_{\alpha\beta} \end{equation}

Experiments

We chose to evaluate our proposal on an extended version of the living room sequences in the ICL-NUIM dataset. ICL-NUIM is a synthetic photorealistic dataset that provides ground truth poses as well as 3D scene models to benchmark reconstruction and/or localization approaches. As ICL-NUIM does not provide IMU data, we fit splines to the ground truth poses to simulate continuous trajectories and simulated IMU measurements from them. We also evaluated our RGB-D-inertial flow in the OpenLORIS-Scene datasets, in which data are collected in real-world indoor scenes, for multiple times in each place to include natural scene changes in everyday scenarios. RGB-D images and IMU measurements from a RealSense D435i are provided. The ground truth trajectory was recorded by an OptiTrack MCS, that tracked artificial markers deployed on the Segway robot used to record the data.

Conclusions

In this work we present a novel camera motion estimation based on RGB-D-I scene flow. Specifically, we formulate the fusion of RGB-D and inertial data as a joint optimization using scene flow residuals and pre-integrated IMU residuals, weighted by their corresponding covariances. We also consider the marginalization of old states in order to keep a compact optimization. We evaluated our approach on a synthetic dataset, ICL-NUIM, and on a real dataset, OpenLORIS, both publicly available. Our results quantify the improvement that inertial fusion can offer to RGB-D scene flow techniques. We evaluated our approach on a synthetic dataset, ICL-NUIM, and on a real dataset, OpenLORIS, both publicly available. Our results quantify the improvement that inertial fusion can offer to RGB-D scene flow.

Article

This work will be presented at CVPR2024 Workshop on Visual Odometry and Computer Vision Applications in Seattle, USA. The article is available on arxiv (link below).

Click here

Home