iCaps: Iterative Category-Level Object Pose and Shape Estimation

Tracking 6D poses of objects provides rich information to robots in performing different tasks such as manipulation and navigation. Most techniques have so far mainly dealt with the instance-level 6D pose estimation, where a set of 3D CAD models of known instances are given as priors. This significantly limits the practical robotic applications since it can be expensive or even impossible to acquire the CAD models of all the objects.

In category-level 6D object pose estimation, a target object may be unseen during training and its 3D CAD model is not available. Without 3D CAD models, the correspondence matching approaches would run into a significant challenge under the considerable shape variations among objects. Therefore, the major challenge is how to handle the intra-class variability.

While more attention has been made in improving 6D pose estimation, object shape estimation, however, has been less explored. Good shape estimation can intuitively assist object pose estimation especially for category-level pose estimation where the intra-class variability is the major challenge.

Category-level Auto-encoder

We propose a pose tracking framework using a Rao-Blackwellized particle filter (RBPF) with an auto-encoder network for measurement update. It takes into normalized depth image for arbitrary object in a category with domain randomization and map them to a low dimensional code, then reconstruct them in another normalized depth image for a canonical object within the same category under same orientation but without domain randomization. In this way, the code in the middle is independent of the object instances, but robust to the intra-class variation. With this auto-encoder network, we can pre-compute a codebook for object under our discretized orientation, the code with record the features represents how similar the current object orientation is with that in the codebook. So that we can use this as our observation prediction.

Category-level 6D Pose Tracking

We factorize the states to the translation, object size, and the orientation distribution conditioned on the translation and size. The translation and size are used to determine bounding boxes and generate the normalized depth maps. They are fed to the auto-encoder network, the observation likelihood can then be computed by comparing the embedding and a precomputed codebook which encodes the canonical object in all the rotations. Intuitively, if the translation and size are sampled incorrectly, the auto-encoder will not generate meaningful embeddings, which will result in low matching score with all the embeddings in the codebook.

Architecture the proposed category-level pose and shape estimation framework.

After estimating the object’s pose and size, the point cloud is then converted to the object body frame and normalized by the size. Then we use a pointnet++ based deep neural network (LatentNet) to predict the shape latent. The predicted shape latent can be used together with a DeepSDF decoder network to compute the SDF values. The pose and size can be further refined by minimizing the SDF values. Since good pose and size estimate and shape latent estimate can improve each other. The pose refine and shape estimation can be performed iteratively.

Visualization of the category-level pose and shape estimation.

Video

Publication

@article{deng2022icaps,
  title={iCaps: Iterative category-level object pose and shape estimation},
  author={Deng, Xinke and Geng, Junyi and Bretl, Timothy and Xiang, Yu and Fox, Dieter},
  journal={IEEE Robotics and Automation Letters},
  volume={7},
  number={2},
  pages={1784--1791},
  year={2022},
  publisher={IEEE}
}

iCaps: Iterative Category-Level Object Pose and Shape Estimation

Category-level Auto-encoder

Category-level 6D Pose Tracking

Shape Estimation and Pose Refinement

Video

Publication