Syllabus
Unit 1
Introduction to Speech Processing, Overview of the human speech production system, acoustic and physiological mechanisms of speech production, glottal signal characteristics and source features, significance of glottal activity regions, speech signal characteristics, acoustic/articulatory characteristics of different speech sounds -vowels and consonants.
Unit 2
Short time processing of speech for estimation of excitation and vocal tract features – Time Domain processing- Energy, magnitude, zero crossing rate, STACF, Linear Prediction Analysis, Frequency domain processing and Spectro-temporal representation of speech signal- Narrowband, wideband spectrograms, Cepstral Analysis, Melspectrogram, MFCC feature extraction.
Unit 3
Speech data preparation and feature engineering, machine learning versus deep learning models in speech classification tasks (age, gender, dialect/accent), Automatic speech recognition (ASR) – statistical models- Hidden Markov Models (HMMs) for ASR, Deep learning speech recognition pipeline (end-to-end models), overview of other speech technology applications such as emotion recognition, speaker recognition, speech synthesis, and speech pathology detection.
Objectives and Outcomes
Course Objective
- To understand the principles of speech processing, human speech production and perception system.
- To estimate excitation and vocal tract features using time and frequency domain processing techniques.
- To explore the various conventional, machine learning and deep learning models for speech classification, recognition, synthesis, and detection tasks
Course Outcomes
After completing this course, students will be able to
CO1
|
Analyse the acoustic/articulatory characteristics of different speech regions and speech sounds
|
CO2
|
Apply time and frequency domain processing techniques to speech signals
|
CO3
|
Analyse and extract relevant spectral parameters and temporal parameters of speech signal
|
CO4
|
Evaluate the performance of a model or algorithm (conventional/Machine learning/Deep learning) developed for a speech technology application
|
CO-PO Mapping
PO/PSO
|
PO1
|
PO2
|
PO3
|
PO4
|
PO5
|
PO6
|
PO7
|
PO8
|
PO9
|
PO10
|
PO11
|
PO12
|
PSO1
|
PSO2
|
PSO3
|
CO
|
CO1
|
3
|
3
|
3
|
3
|
2
|
1
|
1
|
1
|
2
|
3
|
1
|
2
|
1
|
1
|
1
|
CO2
|
3
|
3
|
3
|
3
|
3
|
1
|
1
|
1
|
2
|
2
|
1
|
2
|
3
|
2
|
1
|
CO3
|
3
|
3
|
3
|
3
|
3
|
1
|
1
|
1
|
2
|
2
|
1
|
2
|
3
|
2
|
2
|
CO4
|
3
|
3
|
3
|
3
|
3
|
2
|
1
|
2
|
2
|
2
|
1
|
3
|
3
|
3
|
3
|
Text Books / References
Text Books / References
‘Fundamentals of Speech Recognition’, L. Rabiner, Biing-Hwang Juang and B. Yegnanarayana, Pearson Education Inc.2009
‘Speech Communication’, Douglas O’Shaughnessy, University Press, 2001
‘Discrete Time Speech Signal Processing’, Thomas F Quatieri, Pearson Education Inc., 2004
Hannun, Awni, et al. “Deep speech: Scaling up end-to-end speech recognition.” arXiv preprint arXiv:1412.5567 (2014).
Collobert, Ronan, Christian Puhrsch, and Gabriel Synnaeve. “Wav2letter: an end-to-end convnet-based speech recognition system.” arXiv preprint arXiv:1609.03193 (2016).
Gulati, Anmol, et al. “Conformer: Convolution-augmented transformer for speech recognition.” arXiv preprint arXiv:2005.08100 (2020).
Shen, Jonathan, et al. “Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018