Publication Type : Conference Paper
Publisher : IEEE
Source : In 2022 International Conference on Disruptive Technologies for Multi-Disciplinary Research and Applications (CENTCON) (Vol. 2, pp. 185-190). IEEE.
Url : https://ieeexplore.ieee.org/document/10051557
Campus : Bengaluru
School : School of Engineering
Department : Electronics and Communication
Year : 2022
Abstract : Speech Emotion Recognition (SER) is tasked with detecting emotion in speech regardless of semantic information. In this paper we have implemented and compared five different models for SER namely - CNN, BiLSTM with attention, CNN + BiLSTM with attention, Time Distributed CNN with LSTM and Time Distributed CNN with BiLSTM. The dataset used in this paper is Multimodal Emotion Lines Dataset (MELD). Since the dataset is unbalanced, data augmentation is performed to make it balanced. Some of the feature extraction techniques like MFCC, ZCR, mel scale are used to identify the differences in the audio of different emotions. Time distributed CNN with BiLSTM model has performed better than the other models with an accuracy, precision, recall and F1 score of 92%, 92%, 93% and 91% respectively. Despite using an unbalanced dataset we have achieved superior results compared to the state-of-the-art models which are trained on balanced datasets. This shows the prominence of data augmentation and time distributed layers.
Cite this Research Publication : Prasanna, Y. L., Tarakaram, Y., Mounika, Y., Palaniswamy, S., & Vekkot, S. (2022, December). Comparative Deep Network Analysis of Speech Emotion Recognition Models using Data Augmentation. In 2022 International Conference on Disruptive Technologies for Multi-Disciplinary Research and Applications (CENTCON) (Vol. 2, pp. 185-190). IEEE