University of Notre Dame
Browse

Hardware-Aware Quantization for Biologically Inspired Machine Learning and Inference

thesis
posted on 2023-12-21, 20:14 authored by Clemens Schafer

The high computing and memory cost of modern machine learning (ML), especially deep neural networks (DNNs), often precludes their use in resource-constrained devices. However, the promise of ML deployed on always-on edge devices requires on-device, low-latency, low-power, high-accuracy DNN inference and training. Materializing this promise entails developing techniques to reduce the precision of neural network weights, activations, or errors to improve their deployment and training energy efficiency. This thesis focuses on hardware-software codesign optimizations to alleviate the computational cost of DNN training and inference, enabling their use in resource-constraint environments. First, we focus on post training quantization by proposing a method that combines second-order information (Hessians) and inter-layer dependencies to guide a bisection search for finding quantization configurations within a user-configurable model accuracy degradation range, where we highlight latency reductions of 25.48\% (ResNet50), 21.69\% (MobileNetV2), and 33.28\% (BERT) while maintaining model accuracy to within 99.99\% of the baseline. Next, we propose a new quantization aware training approach for mixed precision convolutional neural networks (CNNs) targeting edge-computing, including gradient scaling and channel-wise learned precision. We demonstrate the effectiveness of our techniques on the ImageNet dataset across a range of models including EfficientNet-Lite0 (e.g., 4.14 MB of weights and activations at 67.66\% accuracy) and MobileNetV2 (e.g., 3.51 MB weights and activations at 65.39\% accuracy). Following, we propose a quantization method for continual learning which leverages the Hadamard domain to make efficient use of quantization ranges in the backward pass, this technique beats a floating-point baseline model by 1\% when using 4-bit inputs and 12-bit accumulators for all matrix multiplications in the model. Pushing quantization further we then explore spiking neural networks, e.g. binary activations in stateful models. We start by exploring the energy efficiency of neuromorphic hardware, which is greatly affected by the energy of storing, accessing, and updating synaptic parameters and propose quantization and accessing schemes. We then study the trade-offs associated with learning performance and the quantization of neural dynamics, weights and learning components in spiking neural networks (SNNs), demonstrating a memory reduction by 73.78\% at the cost of 1.04\% test error increase on the dynamic vision sensor cameras (DVS) gesture data set. Finally, we study various combinations of pruning and quantization in isolation, cumulatively, and simultaneously (jointly) to a state-of-the-art SNN targeting DVS gestures showing that a modern SNN does not suffer any loss in accuracy down to ternary weights.

History

Defense Date

2023-09-29

CIP Code

  • 40.0501

Research Director(s)

Siddharth Joshi

Committee Members

Yiyu Shi Walter Scherier Michael Niemier

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Library Record

6514455

Additional Groups

  • Computer Science and Engineering

Program Name

  • Computer Science and Engineering

Usage metrics

    Dissertations

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC