Publication Type : Conference Paper
Publisher : IEEE
Source : International Conference on Artificial Intelligence and Speech Technology (AIST), Delhi, India, 2022, pp. 1-4, doi: 10.1109/AIST55798.2022.10065019: IEEE Xplore
Url : https://ieeexplore.ieee.org/document/10065019
Campus : Coimbatore
School : School of Computing
Year : 2022
Abstract : In recent years, multimodal fusion using deep learning has proliferated in various tasks such as emotion recognition, and speech recognition by drastically enhancing the performance of the overall system. However, the existing unimodal audio speech recognition system has various challenges in handling ambient noise, and varied pronunciations, and is inaccessible to hearing-impaired people. To address these limitations in audio-based speech recognizers, this paper exploits an idea of an intermediary level fusion framework using multimodal information from audio as well as visual movements. We analyzed the performance of the transformer-based audio-visual model for noisy audio. We accessed the model across two benchmark datasets namely LRS2 and Grid. Overall, we identified that multimodal learning for speech offers a better WER compared to other baseline systems.
Cite this Research Publication : A. Kumar, D. K. Renuka, S. L. Rose and M. C. Shunmugapriya, "Attention based Multi Modal Learning for Audio Visual Speech Recognition," 2022 4th International Conference on Artificial Intelligence and Speech Technology (AIST), Delhi, India, 2022, pp. 1-4, doi: 10.1109/AIST55798.2022.10065019: IEEE Xplore