Static Block Floating-Point Quantization for Convolutional Neural Networks on FPGA

Abstract

Convolutional neural networks (CNNs) have been widely applied in various computer vision and speech processing applications. However, the algorithmic complexity of CNNs hinders their deployment in embedded systems with limited memory and computational resources. This paper proposes static block floating-point (BFP) quantization, an effective approach involving Kullback-Leibler divergence, to determine the static shared exponents. Without need for retraining, the proposed approach is able to quantize CNNs to 8 bits with negligible accuracy loss. An FPGA-based hardware design with static BFP quantization is also proposed. Compared with 8-bit integer linear quantization, our experiments show that the hardware kernel based on static BFP quantization can achieve over 50% reduction in logic resources on an FPGA. Based on static BFP quantization, a tool implemented in the PyTorch framework is developed, which can automatically generate optimised configuration according to user requirements for given CNN models, where the entire optimization process takes only a few minutes on an Intel Xeon Silver 4110 CPU.

Publication
2019 International Conference on Field-Programmable Technology (ICFPT)