Publication Type : Journal Article
Publisher : International Journal of Speech Technology
Source : International Journal of Speech Technology, Springer , Volume 24, Issue 2, p.303 - 314 (2021)
Url : https://doi.org/10.1007/s10772-020-09792-x
Campus : Bengaluru
School : Department of Computer Science and Engineering, School of Engineering
Department : Computer Science
Year : 2021
Abstract : Emotions play a significant role in human life. Recognition of human emotions has numerous tasks in recognizing the emotional features of speech signals. In this regard, Speech Emotion Recognition (SER) has multiple applications in various fields of education, health, forensics, defense, robotics, and scientific purposes. However, SER has the limitations of data labeling, misinterpretation of speech, annotation of audio, and time complexity. This work presents the evaluation of SER based on the features extracted from Mel Frequency Cepstral Coefficients (MFCC) and Gammatone Frequency Cepstral Coefficients (GFCC) to study the emotions from different versions of audio signals. The sound signals are segmented by extracting and parametrizing each frequency calls using MFCC, GFCC, and combined features (M-GFCC) in the feature extraction stage. With the recent advances in Deep Learning techniques, this paper proposes a Deep Convolutional-Recurrent Neural Network (Deep C-RNN) approach to classify the effectiveness of learning emotion variations in the classification stage. We use a fusion of Mel–Gammatone filter in convolutional layers to first extract high-level spectral features then recurrent layers is adopted to learn the long-term temporal context from high-level features. Also, the proposed work differentiates the emotions from neutral speech with suitable binary tree diagrammatic illustrations. The methodology of the proposed work is applied on a large dataset covering Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset. Finally, the proposed results which obtained accuracy more than 80% and have less loss are compared with the state of the art approaches, and an experimental result provides evidence that fusion results outperform in recognizing emotions from speech signals.
Cite this Research Publication : Kumaran U., S. Rammohan, R., Nagarajan, S. Murugan, and Prathik, A., “Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN”, International Journal of Speech Technology, vol. 24, no. 2, pp. 303 - 314, 2021.