E2E feature and matching learning for camera localization (2022)

We propose an end-to-end framework that jointly learns key-point detection, descriptor representation and cross-frame matching for the task of image-based 3D localization. We design a self-supervised image warping correspondence loss for both feature detection and matching, a weakly-supervised epipolar constraints loss on relative camera pose learning, and a directional matching scheme that detects key-point features in a source image and performs coarse-to-fine correspondence search on the target image. Bench-marking our approach on public data-sets, exemplifies how such an end-to-end framework is able to yield more accurate localization that out-performs both traditional methods as well as state-of-the-art weakly supervised methods. (in submission for review).

Single view category level pose estimation & reconstruction (2021)

We present a deep learning model (Glissando-Net) to simultaneously estimate the pose and reconstruct the 3D shape of objects at the category level from a single RGB image. The network is composed of two auto-encoders that are jointly trained, one for RGB images and the other for point clouds. We embrace two key design choices in Glissando-Net to achieve a more accurate prediction of the 3D shape and pose of the object given a single RGB image as input. Extensive experiments, involving both ablation studies and comparison with competing methods, demonstrate the efficacy of our proposed method, and compare favorably with the state-of-the-art. (in submission to TPAMI for review).

Camera network position optimization via deep learning (2020)

Efficient 3D space sampling to represent an underlying 3D object/scene is essential for 3D vision, robotics, and beyond. A standard approach is to explicitly sample a dense collection of views and formulate it as a view selection problem, or a set cover problem. We introduce a novel approach that avoids dense view sampling. The key is to learn a view prediction network and a trainable aggregation module that takes the predicted views as input and outputs an approximation of their generic scores (e.g., surface coverage, viewing angle from surface normals). This methodology allows us to turn the set cover problem (or multi-view representation optimization) into a continuous optimization problem, arriving at similar or better solutions against the baseline with about 10 x speed up in running time, comparing with the standard methods.

Probabilistic 3D occupancy inference (2010)

This paper shows that occluders in the interaction space of dynamic objects can be detected and their 3D shape fully recovered as a byproduct of shape-from-silhouette (SfS) analysis from calibrated inward-looking multi-camera setup in natural uncontrolled environments where occlusions are common and inevitable to SfS techniques. We provide a Bayesian sensor fusion formulation to process all occlusion cues occurring in a multi-view sequence. Several outdoor natural environment datasets as well as an indoor dataset show that the shape of static occluders can be robustly recovered from pure dynamic object motion, and that this information can be used for online self-correction and consolidation of dynamic object shape reconstruction. The result has been accepted by CVPR 2007 as oral presentation. Please check out the publication page for details. The dataset sculpture2ppl used in the paper is available online now.

Multi-object shape estimation and tracking (2009)

We propose a new algorithm to automatically detect and reconstruct scenes with a variable number of dynamic objects. Our formulation distinguishes between m different silhouettes in the scene by using automatically learnt view-specific object appearance models, eliminating the color calibration requirement. Bayesian reasoning is then applied to solve the m-shape occupancy problem, with m updated as objects enter or leave the scene.

The result has been accepted by CVPR 2008. And an extended version of this paper is submitted to IJCV for review. Please check out the publication page for details.

Heterogeneous sensor network calibration (2008)

I propose a unified calibration technique for a heterogeneous sensor network of video camcorders and Time-of-Flight (ToF) cameras. By moving a spherical calibration target around the commonly observed scene, one can robustly and conveniently extract the sphere centers in the observed images and recover the geometric extrinsics for both types of sensors. This framework uses a space occupancy grid as a probabilistic 3D representation of scene contents. The result has been accepted by 3DPVT 2008 and M2SFA2 2008 (in conjunction with ECCV 2008) both as oral presentations. Please check out the publication page for details. The MATLAB reconstruction code is available now.