To this end we design an end-to-end differentiable MVCNN that takes an input a voxel representation and generates a set of views using a differentiable renderer. While generating adversarial inputs to VoxNet and PointNet is straightforward, it is not the case for multiview methods due to the rendering step. ![]() Cross-modal distillation improves the performance of VoxNet and PointNet, especially when training data is limited.įinally we analyze the robustness of these classifiers to adversarial perturbations. In particular we use representations extracted from pretrained MVCNNs to guide learning of voxel-based and point-based networks. As 3D shape datasets are currently lacking in comparison to large image datasets, we employ cross-modal distillation techniques to guide learning. We then analyze the role of initialization of these networks. We find that the multiview approaches generalize faster obtaining near optimal performance with far fewer examples compared to the other approaches. First we analyze the accuracy of various models by varying the number of training examples per category. We then systematically analyze the generalization ability of the models. Furthermore, the performance of MVCNN remains at 93.6% even when trained with binary silhouettes (instead of shaded images) of shapes, suggesting that shading offer relatively little extra information on this benchmark for MVCNN. Another example is that while it is widely believed that the strong performance of MVCNN is due to the use of networks pretrained on large image datasets (e.g., ImageNet ), we find that even without such pretraining the MVCNN obtains 91.3% accuracy, outperforming several voxel-based and point-based counterparts that also do not rely on such pretraining. For example, with deeper architectures and a modification in the rendering technique that renders with black background and better centers the object in the image the performance of a vanilla MVCNN can be improved to 95.0% per-instance accuracy on the benchmark, outperforming several recent approaches. Some of our analysis leads to surprising results. The analysis is done on the widely-used ModelNet40 shape classification benchmark . For multiview representation we choose the Multiview CNN (MVCNN) architecture For voxel-based representation we choose the VoxNet constructed using convolutions and pooling operations on a 3D grid For point-based representation we choose the PointNet architecture . We pick a representative technique for each modality. This paper aims to study three of these tradeoffs, namely the ability to generalize from a few examples, computational efficiency, and robustness to adversarial transformations. However, there is relatively little work that studies the tradeoffs offered by these modalities and their associated techniques. These range from multiview approaches that render a shape from a set of views and deploy image-based classifiers, to voxel-based approaches that analyze shapes represented as a 3D occupancy grid, to point-based approaches that classify shapes represented as collection of points. In recent years a variety of deep architectures have been approached for classifying 3D shapes. Techniques for analyzing 3D shapes are becoming increasingly important due to the vast number of sensors that are capturing 3D data, as well as numerous computer graphics applications. We find that point-based networks are more robust to point position perturbations while voxel-based and multiview networks are easily fooled with the addition of imperceptible noise to the input. ![]() ![]() Finally, we analyze the robustness of 3D shape classifiers to adversarial transformations and present a novel approach for generating adversarial perturbations of a 3D shape for multiview classifiers using a differentiable renderer. ![]() Furthermore, the performance of voxel-based 3D convolutional networks and point-based architectures can be improved via cross-modal transfer from image representations. Our analysis shows that multiview methods continue to offer the best generalization even without pretraining on large labeled image datasets, and even when trained on simplified inputs such as binary silhouettes. By varying the number of training examples and employing cross-modal transfer learning we study the role of initialization of existing deep architectures for 3D shape classification. We investigate the role of representations and architectures for classifying 3D shapes in terms of their computational efficiency, generalization, and robustness to adversarial transformations.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |