Convolutional Neural Networks (CNN) architectures have been increasingly well-known for image processing
applications such as object detection, and remote sensing. Some applications like these systems need to adopt CNN methods for
real-time implementation. Embedded devices like Field Programmable Gate Arrays (FPGA) technologies are a favorable
alternative to implementing CNN-based algorithms. However, FPGA has some drawbacks such as limited resources and
bottlenecks, it is difficult and so crucial to map the whole CNN that has a high number of layers, on FPGA without any
optimization. Therefore, hardware optimization techniques are compulsory. In this study, an FPGA-based CNN architecture
using high-level synthesis (HLS) is demonstrated, and a synthesis report is created for Xilinx Zynq-7000 xc7z020-clg484-1
target FPGAs. By implementing the CNN architecture on an FPGA platform, the implemented architecture has been fastened.
To improve the throughput, the proposed design is optimized for convolutional layers. The most important contribution of this
study is to perform optimization on the convolution layer by unrolling kernels and input feature maps and examine the effects
on throughput, latency, and hardware resources. In this study, throughput is 15.6 GOP/s for the first convolution layer. With the
proposed method in the study, approximately x2.6 acceleration in terms of latency and throughput was achieved compared to
the baseline design.