PointCloud Semantic Processing

While Lidar and RGBD cameras are widely used for obstacle avoidance and mapping, their use for semantic understanding of environment is still relatively unexplored.

VoxNet: A 3D Convulational Neural Network for Real-time Object Recognition - Maturana and Scherer (Robotics Insitute, CMU)

First they convert the input pointcloud into a volumetric grid. A volumetric occupancy code represents the environment as a 3D lattice and store a probablistic estimate of occupancy as a function of incoming sensor data and prior knowledge. These representation is richer than those with only free vs occupied grid cells. They try two kinds of resolution , fixed - here the network can use the relative scale of objects (e.g, humans and cars has consistent physical size) and relative - here loss of information in case of very small objects is avoided. They feed this voxel to a CNN to obtain classification scores. After training, this CNN's filters is able to recognize spatial structures like planes or corners at different orientations and using multiple layers of such filters, the CNN can detect a hierarchy of more complex feature.
Multi-view Convolutional Neural Networks for 3D Shape Recognition - Su et al (University of Massachusetts, Amherst)

While voxels are better at providing explicit depth information, projected 2d images have higher spatial resolution, can be processed faster and it is easier to learn generic 2d features due to massive image databases. First they take images from multiple views, feed each of them through a common CNN architecture to get a feature map per view. In the view pooling layer, they take dimension-wise maximum operation across each view's feature map. This max-pooled feature map is then feed forward through the rest of the CNN.
Volumetric and Multi-View CNNs for Object Classification on 3D Data - Qi et al (Stanford)

Recently two types of CNNs have been developed for pointclouds - CNNs based on volumetric representation vs CNNs based on multi-view representations. That fact that CNNs based on multiple 2D views have achieved better performance indicate that they were unable to exploit the power of 3D representations. To improve these CNNs, they introduce two distinct network architectures.
i) The first network mitigates overfitting by introducing auxiliary training tasks. Their network outputs 2X2X2 feature map with 512 dimension. They construct a 512 length vector from each of 8 locations(that only have a partial receptive field) and use it to classify the object. This forces the network to be able to predict object class using only partial subvolumes.

ii) In the second network architecture, they project the 3D volumetric information to 2D similar to X-Ray scanning and use this for classification. This 2D feature map is capable of capturing internal structure of objects unlike standard rendering.

To counter the model orientation sensitivity in both the networks, they use multi-orientation volumetric CNNs. This network aggregates information from different orientations using multi-orientation pooling similar to view pooling in "Multiview CNNs for 3D shape recognition".
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation - Qi et al (Stanford)

In this paper, they design a neural network architecture that directly consumes the pointclouds. Their network has three key properties - the network is invariant to the permutation of input set, it combines local and global structure for point wise segmentation and can learn to be invariant to rotation, translation etc. For classification, [f₁, f₂, ... ] = g(MAX_1,n{h(x_i)}) is computed(g and h are approximated as deep neural networks) and is fed through another deep neural network to get class scores. For segmentation, we need to combine local and global knowledge. So they concatenate the global point cloud feature vector to each point's features and extract new per point features that are aware of both local and global information. To help the network be invariant to geometric transformation in pointcloud, they learn a alignment network that applies affline transformation on the input and the feature vectors per point.
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space - Qi et al (Stanford)

The basic idea of PointNet is to learn a spatial encoding of each point and then aggregate all individual point features to a global point cloud signature. While PointNet can't capture local structure, in PointNet++ (similar to CNNs) we extract local features capturing fine geometric structures from small neighborhoods; such local features are further grouped into larger units and processed to produce higher level features. At each level, they sample centroids using iterative farthest point sampling(unlike voxels this sampling can capture more detail in dense clusters and and less details in sparse regions), assigns points to overlapping clusters and use PointNet to compute a feature vector for each cluster. Instead of sampling centroids at all scales for the complete point cloud(multi-scale grouping), they use multi-resolution grouping which is more computationally efficient.

Accuracy Comparison for the above Networks

Published with Simplenote

Report abuse