Neural Computing and Applications, cilt.37, sa.22, ss.17825-17857, 2025 (Scopus)
Transformer is a kind of deep neural network that relies on the technique of self-attention and used initially in the field of natural language processing. Scientists use transformer for computer vision (CV) applications because of its good data representation capabilities. Transformer-based models yield similar performance or surpass other network architectures, including convolutional and recurrent neural networks, in a variety of visual benchmarks. In this work, we investigate the methods for video anomaly detection (VAD) using vision transformer models in the recent literature. The main topics we explore comprise vision transformers used in CV applications with a special focus on VAD methods leveraging transformer architecture. We also briefly present anomaly detection methods based on transformers. Additionally, we address the advantages, challenges and current limitations of the transformer architecture as well as potential solutions to address the technical challenges. In the concluding section of this study, we offer avenues for further investigation concerning the use of vision transformers in VAD tasks.