Fast Encoders for Object Detection From Point Clouds

With the rapid development of 3D acquisition technology, 3D sensors have become more and more available and affordable, including various types of 3D scanners, LiDAR, and RGB-D cameras (such as Kinect, RealSense, and Apple depth cameras). The 3D data acquired by these sensors can provide rich geometry, shape, and scale information. Complementing 2D images, 3D data provides an opportunity to better understand the environment around the machine. 3D data has many applications in different fields, including autonomous driving, robotics, remote sensing, medical treatment, and design industries.

Following the tremendous advances in deep learning methods for computer vision, a large body of literature has investigated to what extent this technology could be applied towards object detection from lidar point clouds.

Compared with images, laser point cloud data is 3D and has sparseness. Therefore, the pre-coding of the point cloud is particularly important. At present, most algorithms perform point cloud object detection under a bird’s-eye view. There are two main types of cloud coding preprocessing methods:

The point cloud is voxelized with a certain resolution, and the voxel set in each vertical column is encoded into a fixed length, hand-made feature, and finally a three-dimensional pseudo image is formed. The representative method is MV3D, AVOD, PIXOR, Complex YOLO;
PointNet unordered point cloud processing method, represented by Frustum PointNet , VoxelNet , SECOND , the latter two are coded in a bird’s-eye view and require 3D convolution operations.

The PointPillar proposed in this paper is a continuation of the work of VoxelNet and SECOND. VoxelNet introduces the idea of PointNet into the voxel feature encoding after voxelization , and then uses 3D convolution for feature extraction, and then uses Traditional 2D convolution is used for target detection; SECOND takes into account the sparsity of point cloud features, and replaces traditional convolution with 2D sparse convolution, which gives a great hint of speed. On the other hand, PointPillar does not perform segmentation on the vertical column of voxels, thereby removing the 3D convolution operation. Its advantages are:

[1] propose PointPillars, a novel encoder which utilizes PointNets to learn a representation of point clouds organized in vertical columns (pillars).

#computer-vision #object-detection #machine-learning #point-cloud

medium.com

Fast Encoders for Object Detection From Point Clouds