Minutiae-Free Fingerprint Recognition via Vision Transformers: An Explainable Approach

ARSLAN, BİLGEHAN

doi:10.3390/app16021009

Minutiae-Free Fingerprint Recognition via Vision Transformers: An Explainable Approach

ARSLAN B.

Applied Sciences (Switzerland), cilt.16, sa.2, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 16 Sayı: 2
Basım Tarihi: 2026
Doi Numarası: 10.3390/app16021009
Dergi Adı: Applied Sciences (Switzerland)
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
Anahtar Kelimeler: biometrics, capacitive, contactless, deep learning, fingerprint, optical, recognition, vision transformer, XAI
Gazi Üniversitesi Adresli: Evet

Özet

Fingerprint recognition systems have relied on fragile workflows based on minutiae extraction, which suffer from significant performance losses under real-world conditions such as sensor diversity and low image quality. This study introduces a fully minutiae-free fingerprint recognition framework based on self-supervised Vision Transformers. A systematic evaluation of multiple DINOv2 model variants is conducted, and the proposed system ultimately adopts the DINOv2-Base Vision Transformer as the primary configuration, as it offers the best generalization performance trade-off under conditions of limited fingerprint data. Larger variants are additionally analyzed to assess scalability and capacity limits. The DINOv2 pretrained network is fine-tuned using self-supervised domain adaptation on 64,801 fingerprint images, eliminating all classical enhancement, binarization, and minutiae extraction steps. Unlike the single-sensor protocols commonly used in the literature, the proposed approach is extensively evaluated in a heterogeneous testbed with a wide range of sensors, qualities, and acquisition methods, including 1631 unique fingers from 12 datasets. The achieved EER of 5.56% under these challenging conditions demonstrates clear cross-sensor superiority over traditional systems such as VeriFinger (26.90%) and SourceAFIS (41.95%) on the same testbed. A systematic comparison of different model capacities shows that moderate-scale ViT models provide optimal generalization under limited-data conditions. Explainability analyses indicate that the attention maps of the model trained without any minutiae information exhibit meaningful overlap with classical structural regions (IoU = 0.41 ± 0.07). Openly sharing the full implementation and evaluation infrastructure makes the study reproducible and provides a standardized benchmark for future research.