Challenges and enhancements in Turkish automatic lip reading using deep learning models


Sabaz F., Atila Ü., Dörterler M., Uçan A.

SIGNAL IMAGE AND VIDEO PROCESSING, cilt.20, sa.4, 2026 (SCI-Expanded, Scopus) identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 20 Sayı: 4
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1007/s11760-026-05252-2
  • Dergi Adı: SIGNAL IMAGE AND VIDEO PROCESSING
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, zbMATH
  • Gazi Üniversitesi Adresli: Evet

Özet

This study delves into the intricacies of automatic lip reading (ALR) in the Turkish language, employing a deep learning model that integrates convolutional neural networks (CNNs) and long short-term memory (LSTM) units. By analyzing a comprehensive dataset comprising 111 words and 113 sentences, encompassing 67,080 video samples, the research identifies several key challenges inherent in Turkish ALR. The introduction of a novel "Sentences with Derived Words" (SDW) dataset underscores the impact of Turkish agglutinative morphology on ALR performance. Through rigorous analysis of commonly misclassified words, word lengths, phonetic resemblances, and consonant-vowel interactions, the study reveals that morphological diversity, phonetic similarities, and word lengths significantly influence model accuracy. Words containing bilabial consonants exhibit higher recognition rates, whereas shorter words and those with similar structures are more prone to misclassification. The SDW dataset, in particular, highlights the challenges posed by derived words, as evidenced by a notable decrease in word recognition rate. This work not only identifies critical limitations but also proposes recommendations for improving ALR models in Turkish, emphasizing the incorporation of phonetic features and the enhancement of dataset diversity. The findings offer profound insights into advancing ALR technologies for diverse applications, including assistive communication, forensic analysis, and human-computer interaction.