MIDI Audio Alignment
Today’s class will cover the material in the literature review section in Devaney, J. (2014). “Estimating onset and offset asynchronies in polyphonic audio-to-score alignment.” Journal of New Music Research 43(3): 266–275.
MIDI-audio alignment can be used instead of blind onset detection for signals that the onset detection algorithms fail on (namely instruments with non-percussive onsets and the singing voice). For this task, non-realtime (offline) alignment algorithms are appropriate. One of the most robust (and simplest) approaches to offline alignment is dynamic time warping (DTW).
DTW, a type of dynamic programming, allows for the alignment of similar linear patterns, or sequences, evolving at different rates. Through DTW, the two sequences are aligned by warping them to minimize a cost function that penalizes both local and sequential mismatch. Below is a dynamic time warping similarity matrix. In the similarity matrix, black indicates maximum similarity and white indicates maximum dissimilarity; shades of grey indicate intermediate steps. The best path through the similarity matrix is a warping from note events in the MIDI to their occurrences in the audio. The black line indicates the optimal path through the similarity matrix, which is used to warp the timing in the audio and MIDI to match each other. The y-axis is the number of audio frames and the x-axis is the number of MIDI frames. Black indicates high similarity and white indicates low similarity.
Today we will be using Dan Ellis’ implementation of Orio and Schwarz’s DTW algorithm.