Clippers: Deblin Bagchi on Multilingual Speech Recognition

Learning from the best: A teacher-student framework multilingual models for low-resource languages.

Automatic Speech Recognition (ASR) in low resource languages is problematic because of the absence of transcripted speech. The amount of training data for any specific language in this category does not exceed 100 hours of speech. Recently, it has been found that knowledge obtained from a huge multilingual dataset (~ 1500 hours) is advantageous for ASR systems in low resource settings, i.e. the neural speech recognition models pre-trained on this dataset and then fine-tuned on language-specific data report a gain in performance as compared to training on language-specific data only. However, it goes without saying that a lot of time and resources are required to pre-train these models, specially the ones with recurrent connections. This work investigates the effectiveness of Teacher-Student (TS) learning to transfer knowledge from a recurrent speech recognition model (TDNN-LSTM) to a non-recurrent model (TDNN) in the context of multilingual speech recognition. Our results are interesting in more than one level. First, we find that student TDNN models trained using TS learning from a recurrent model (TDNN-LSTM) perform much better than their counterparts pre-trained using supervised learning. Second, these student models are trained only with language-specific data instead of the bulky multilingual dataset. Finally, the TS architecture allows us to leverage untranscribed data (previously untouched during supervised training) resulting in further improvement in the performance of the student TDNNs.