Clippers 1/31: David Palzer on N-Pathic Speaker Diarization

Title: N-Pathic Speaker Diarization
Abstract: Speaker diarization is mainly studied through clustering speaker embeddings. However, the clustering approach has two major limitations: it doesn’t minimize diarization errors and can’t handle speaker overlaps. To address these problems, End-to-End Neural Diarization (EEND) was introduced. The Encoder-Decoder-Attractor (EDA) was also proposed for recordings with unknown speaker count. In this paper, we present two improvements: (1) N-Pathic, a base model that uses chunked data to reduce attention mechanism length and memory usage, and (2) an improved EDA architecture with increased data efficiency through non-sequence-dependant modules. Our proposed method was evaluated on simulated mixtures, real telephone calls, and real dialogue recordings.