Comparison of DETR and YOLO11x for Small Object Detection and Motion Direction Estimation

Ozden G. U., YILMAZ D.

5th International Conference on Informatics and Software Engineering, IISEC 2026, Ankara, Türkiye, 5 - 06 Şubat 2026, ss.220-225, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/iisec69317.2026.11418472
Basıldığı Şehir: Ankara
Basıldığı Ülke: Türkiye
Sayfa Sayıları: ss.220-225
Anahtar Kelimeler: DETR, Motion Direction Estimation, Object Detection, Small Object Recognition, Sports Analytics, Transformer, YOLO11x
Gazi Üniversitesi Adresli: Evet

Özet

Small object detection is an important matter that limits the performance of deep learning due to challenges such as low pixel density, motion blur, and scale variations. This issue creates a significant need for detection stability, particularly in applications such as sports analytics, automated referee assistance systems, and video-based tracking. The motivation of this study is to examine how modern detection models perceive small and fast-moving objects, and to specifically evaluate the behavior of transformer-based approaches in such scenes. In this context, a dataset was constructed from professional match recordings and annotated in COCO format. On this dataset, the transformer-based DETR (ResNet-50 backbone) model and the multi-scale Convolutional Neural Network (CNN) based YOLO11x model were compared. Evaluations were conducted through AP/AR trends, precision-recall behavior, and qualitative output analysis. The results show that DETR, thanks to its global context modeling capability, detects small objects more consistently and tends to produce more consistent confidence scores across frames. While DETR demonstrates stable object perception, YOLO11x provides tighter bounding box localization due to its multi-scale architecture, but its confidence scores are not as consistent as those of DETR. This highlights the importance of contextual information in small object detection and underscores the potential of transformer-based models in this domain. Additionally, motion direction estimation of the ball trajectory was performed, showing consistent and reliable directional trends across consecutive frames. From an application perspective, YOLO11x is preferable when precise localization at higher IoU thresholds is prioritized, whereas DETR can be advantageous when contextual cues support stable detections.