Low Power Quantized YOLO for Face Detection on FPGA


Günay B., Okçu S. B., Bilge H. Ş.

International Graduate Research Symposium - IGRS’2022, İstanbul, Türkiye, 1 - 03 Haziran 2022, ss.358

  • Yayın Türü: Bildiri / Özet Bildiri
  • Basıldığı Şehir: İstanbul
  • Basıldığı Ülke: Türkiye
  • Sayfa Sayıları: ss.358
  • Gazi Üniversitesi Adresli: Evet

Özet

With the latest advances in embedded systems, artificial intelligence applications on edge devices have increased. In older systems, data was being collected from the edge devices and decision making was applied on the servers. Therefore, low network speed or network problems were limiting the system performance. However, smarter applications can be developed and run on modern embedded systems. System-on-Chip (SoC) architectures which contain CPU and FPGA (Field Programmable Gate Array) in a single chip together offer low power consumption while running Convolutional Neural Networks (CNNs). In this paper, we modified, trained and deployed TinyYolov3 architecture for face detection by using Brevitas and FINN framework on PYNQ-Z2 board which is a low-end cheap development board having Xilinx Zynq 7020 SoC. With Brevitas which provides quantized version of convolution, fully-connected and activation layers of a CNN in Pytorch, we created modified version of TinyYolov3 with various integer bit precisions for weights (W) and activations (A) such as 2W4A, 3W5A, 4W2A, 4W4A, 6W4A and 8W3A. Then, we trained it in a quantized form with WiderFace dataset. To reduce power consumption and get higher speed, we optimized the logical resource allocation and used on-chip memory of the FPGA to store the weights and activations. Additionally, we changed the last layer’s activation function Sigmoid to rescaled HardTanh. To run the trained backbone CNN on the FPGA, we synthesized it with Vitis HLS and Vivado by using FINN-HLS library which contains layer definitions of the created model in C++. Besides, we utilized CPU of the SoC for preprocessing, postprocessing and TCP/IP streaming of the results in a multithreading approach to increase throughput. As a result, with the 4W4A bit precision, we observed 18 Frames Per Second (FPS) throughput, 2.4 Watt total power consumption on PYNQ-Z2, 70% utilization of the resources of the FPGA and 3% Mean Average Precision (mAP) drop on the accuracy compared to nonquantized version of the model.