Publication Type : Journal Article
Publisher : Elsevier
Source : Knowledge-Based Systems on 9th February 2022
Url : https://doi.org/10.1016/j.knosys.2022.108360
Campus : Bengaluru
School : Department of Computer Science and Engineering, Department of Electronics and Communication Engineering, School of Engineering
Department : Computer Science, Electronics and Communication
Verified : No
Year : 2022
Abstract : The paper proposes an integrated speech emotion conversion framework developed using speaker-independent mixed-lingual training. The key contribution of the work is non-parallel training using i-vector probabilistic linear discriminant analysis (PLDA) modelling for estimating emotion-dependent latent vectors for the three archetypal emotions anger, fear, and happiness in three different datasets (languages) viz. EmoDB (German), IITKGP (Telugu) and English (SAVEE). The unified model integrates fundamental frequency (F0) and spectral modifications for neutral to emotional speech conversion. Wavelet synchro squeezed decomposition of F0 and subsequent training using particle swarm optimized neural network (PSO-ANN) provides improved performance with an overall average mel cepstral distortion (MCD) of 4.72 dB and F0-RMSE of 25.91 Hz while subjective testing revealed an overall average mean opinion score (MOS) of 3.4, comparative mean opinion score (CMOS) of 3.57, and a speaker similarity score of 3.72, on a scale of 1–5. A detailed comparative analysis for emotion conversion in English with state-of-the-art is also performed. The evaluations revealed that the proposed framework gave perceptually relevant expressive enrichment in neutral speech with optimum training data.
Cite this Research Publication : Susmitha Vekkot, Deepa Gupta, "Fusion of spectral and prosody modelling for multilingual speech emotion conversion", Knowledge-Based Systems, Volume 242, 108360, ISSN 0950-7051, 2022.