© 2021 IEEE.Convolutional Neural Network (CNN) is considered one of the significant technological innovations in computer vision tasks. With the popularization of CNN, research on CNN has been aimed at increasing the accuracy performance of the network by expanding the number of layers. But on the other hand, as the number of layers increases, the time required to implement the network also increases. Furthermore, the computational load is also increased, and the biggest challenge is getting the network to work efficiently. In addition to success, speed and power consumption have become important parameters. Thereby, dedicated platforms are needed for the implementation of deep learning algorithms. Reconfigurable Field Programmable Gate Arrays (FPGAs) have been recently adopted to implement and accelerate CNN algorithms. This paper has been fully studying the convolutional layer, which contains nested loops, optimization techniques, and generated variations of the hardware implementation. The most important contribution of this study is to perform optimization on the convolution layer step by step and examine the effects on latency and hardware sources. The convolution layers of CNNs have computational density, and convolution operation relies on nested loops. Therefore, loops have the highest impact on performance/latency. CNN architecture has been developed through high-level language C. Accelerator designs have been built by HLS (High-Level Synthesis) and rely on 32-bit floating-point arithmetic. Performance is determined as regards latency and resource utilization. This study presents the CNN inference accelerator with various optimization to exploit FPGA parallelism for the inference phase.