As one of the fundamental issues in the fields such as computer vision and deep learning, accelerating the inference speed of the convolutional neural network (CNN) has attracted a lot of attention, which is to reduce the gap between the desired high performance of deep CNN and the required huge amount of computation. Various techniques have been explored to address this challenge, such as network compression and quantization, lightweight networks targeting resource-constrained platforms, dynamic computation graphs providing efficient early exits, and, on top of all these techniques, network parallelization. Although a large body of works exists in neural network parallelization, most of them focus on the training phase only or distributing a batch of instances to multiple computation cores. None of these parallelization works, however, helps to reduce the inference latency of a single instance (such as an image), which is critical for real-time applications. Existing techniques for single image inference include operator parallelism and model parallelism. The former technique explores concurrency in operators such as convolution, and the latter distributes kernels of convolutional layers across multiple cores. These approaches do not offer good scalability and usually cannot fully utilize a large number of cores available in modern high-performance computing platforms.
In this dissertation, we introduce a new general framework named statistical convolutional neural network (SCNN) to speedup the instance inference with the help of independent component analysis (ICA), which can be applied to various tasks. As a general framework, SCNN can be implemented with different neural network backbones to speedup, while it is orthogonal to the existing speedup methods. We use ICA to decompose the spatio-temporal correlated data and propagate the extracted essential features as the learning process. The performance evaluations of SCNN in video object detection and 3D cardiac cine MRI segmentation shows that SCNN could achieve a large speedup compared with the existing methods. We further introduce ICA-Net as an improvement of SCNN in accuracy, throughput, and latency. The improved framework is implemented and verified by multiple tasks including image classification, image object detection, video object detection, and 3D cardiac cine MRI segmentation. The latencies of the models are all largely reduced and the accuracies achieve state-of-the-art.