Clippers Tuesday: Deblin Bagchi on mimic loss for robust ASR

For the task of speech enhancement, local learning objectives are agnostic to phonetic structures helpful for speech recognition. We propose to add a global criterion to speech enhancement that allows the model to learn these high-level abstractions. We first train a spectral classifier on clean speech to predict senone labels. Then, the spectral classifier is joined with our speech enhancer as a noisy speech recognizer. This model is taught to mimic the output of the spectral classifier alone on clean speech. This mimic loss is combined with the traditional local criterion to train the speech enhancer to produce de-noised speech. Feeding the de-noised speech to an off-the-shelf Kaldi training recipe for the CHiME-2 corpus shows significant improvements in Word Error Rate (WER).