Hardware-Aware Quantization for Biologically Inspired Machine Learning and Inference
The high computing and memory cost of modern machine learning (ML), especially deep neural networks (DNNs), often precludes their use in resource-constrained devices. However, the promise of ML deployed on always-on edge devices requires on-device, low-latency, low-power, high-accuracy DNN inference and training. Materializing this promise entails developing techniques to reduce the precision of neural network weights, activations, or errors to improve their deployment and training energy efficiency. This thesis focuses on hardware-software codesign optimizations to alleviate the computational cost of DNN training and inference, enabling their use in resource-constraint environments. First, we focus on post training quantization by proposing a method that combines second-order information (Hessians) and inter-layer dependencies to guide a bisection search for finding quantization configurations within a user-configurable model accuracy degradation range, where we highlight latency reductions of 25.48\% (ResNet50), 21.69\% (MobileNetV2), and 33.28\% (BERT) while maintaining model accuracy to within 99.99\% of the baseline. Next, we propose a new quantization aware training approach for mixed precision convolutional neural networks (CNNs) targeting edge-computing, including gradient scaling and channel-wise learned precision. We demonstrate the effectiveness of our techniques on the ImageNet dataset across a range of models including EfficientNet-Lite0 (e.g., 4.14 MB of weights and activations at 67.66\% accuracy) and MobileNetV2 (e.g., 3.51 MB weights and activations at 65.39\% accuracy). Following, we propose a quantization method for continual learning which leverages the Hadamard domain to make efficient use of quantization ranges in the backward pass, this technique beats a floating-point baseline model by 1\% when using 4-bit inputs and 12-bit accumulators for all matrix multiplications in the model. Pushing quantization further we then explore spiking neural networks, e.g. binary activations in stateful models. We start by exploring the energy efficiency of neuromorphic hardware, which is greatly affected by the energy of storing, accessing, and updating synaptic parameters and propose quantization and accessing schemes. We then study the trade-offs associated with learning performance and the quantization of neural dynamics, weights and learning components in spiking neural networks (SNNs), demonstrating a memory reduction by 73.78\% at the cost of 1.04\% test error increase on the dynamic vision sensor cameras (DVS) gesture data set. Finally, we study various combinations of pruning and quantization in isolation, cumulatively, and simultaneously (jointly) to a state-of-the-art SNN targeting DVS gestures showing that a modern SNN does not suffer any loss in accuracy down to ternary weights.
History
Defense Date
2023-09-29CIP Code
- 40.0501
Research Director(s)
Siddharth JoshiCommittee Members
Yiyu Shi Walter Scherier Michael NiemierDegree
- Doctor of Philosophy
Degree Level
- Doctoral Dissertation
Library Record
6514455Additional Groups
- Computer Science and Engineering
Program Name
- Computer Science and Engineering