Title: Simplifying End-to-End Neural Diarization: Generic Speaker Attractors are Enough
Abstract: In this work, we propose a simplified approach to neural speaker diarization by removing the Encoder-Decoder Attractor (EDA) mechanism and replacing it with a linear layer. This modification significantly reduces the model’s parameter count, allowing us to increase the depth of the backbone network by stacking additional Conformer blocks. To further enhance efficiency, we replace the Shaw relative positional encoding in the Conformer blocks with ALiBi positional bias, which improves the handling of short/long-range dependencies while decreasing computational complexity. Our results show that this streamlined model achieves comparable performance to previous diarization systems utilizing dynamic attractors, suggesting that Generic Speaker Attractors—global static learned attractors—can be as effective as dynamic attractors in representing speakers. Furthermore, we observe that the clustering effect, a key feature of previous EDA-based models, is preserved in our approach. These findings suggest that the EDA mechanism may not be necessary for high-quality speaker diarization, and that a more straightforward architecture can yield competitive results.