A comprehensive experimental study for analyzing the effects of data augmentation techniques on voice classification


Bakir H., cayir A. N., NAVRUZ T. S.

MULTIMEDIA TOOLS AND APPLICATIONS, vol.83, no.6, pp.17601-17628, 2024 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 83 Issue: 6
  • Publication Date: 2024
  • Doi Number: 10.1007/s11042-023-16200-4
  • Journal Name: MULTIMEDIA TOOLS AND APPLICATIONS
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, FRANCIS, ABI/INFORM, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, INSPEC, zbMATH
  • Page Numbers: pp.17601-17628
  • Keywords: Convolutional neural networks, Data augmentation, Hyperparameters tuning, Random search, Voice recognition
  • Gazi University Affiliated: Yes

Abstract

It is not always possible to find enough data for deep learning studies. So, various data augmentation techniques have been developed and thus the success of deep learning models has increased. In this work, a comprehensive study has been conducted to evaluate the efficiency of different data augmentation techniques in terms of improving the voice classification models' performance. To this end, we proposed extracting MFCC features from the audio files, converting them into RGB images, and using CNN deep learning model for classifying the constructed RGB images into 12 classes. Moreover, Random search algorithm has been adopted for tuning the hyperparameters and selecting the best CNN model that can achieve this task with as high performance as possible. After that, some voice augmentation and image augmentation techniques have been used to increase the number of samples in the original dataset, and 17 different datasets have been constructed and used for training the proposed model. Particularly, 5 different voice augmentation techniques have been used for constructing 5 different datasets from the original dataset. When the proposed model has been trained using these voice augmentation-based datasets the validation accuracy and F1-score exceed 95%. Then, we suggested using image augmentation techniques for constructing another dataset from the original dataset. By training the model using this dataset we noted that the validation accuracy and F1-score of the proposed model dropped down to 81,92 and 81.67 respectively. Afterward, we suggested applying image augmentation techniques to the voice augmentation techniques-based datasets, and 5 different datasets have been constructed using this method. The results obtained by training the proposed model using these datasets showed that the classification accuracy and F1-score dropped down to 77.25 and 77.08 respectively. So, we concluded that the image augmentation techniques are not so suitable for voice recognition and classification tasks. To prof this hypothesis, we suggested merging the image augmentation-based dataset with voice augmentation-based datasets (i.e. the dataset gave the best results in this study), and 5 different datasets, each of which contains 18,000 samples, have been constructed using this method. When the proposed model has been trained using these datasets the classification accuracy and F1-score did not exceed 84.33% and 84.41% respectively in the best case. Furthermore, in order to prove the obtained results we suggested testing the proposed method using one more dataset from a different language namely the Turkish language dataset. The proposed approach gave more than 96% of classification accuracy when tested using this dataset i.e. Turkish dataset. The obtained results showed that it is preferred to apply voice augmentation techniques to the raw audio files instead of applying image augmentation techniques to the extracted feature matrixes in order to improve the performance of voice classification deep learning models.