Literature Review
Speaker diarization is the process of determining and identifying different speakers in an audio recording. It is readily used for transcription - the task of converting spoken audio into written text and voice biometrics - the task of uniquely identifying a person by analysing characteristics of their voice.
1. PyAnnote-Audio
Developed by: Hervé Bredin et al.
Platform: Python (PyTorch-based)
License: MIT
Repository: https://github.com/pyannote/pyannote-audio
Methodology:
PyAnnote-Audio adopts a modular pipeline:
- Voice Activity Detection (VAD) using a neural network.
- Speaker Embedding extraction using pretrained models (e.g., ECAPA-TDNN).
- Clustering with Agglomerative Hierarchical Clustering (AHC), spectral clustering, or Bayesian HMM-based diarization.
Features:
- Highly modular and configurable.
- Accurate diarization across languages and domains.
- Provides pretrained models for all components.
- Integration with Hugging Face Transformers and pyannote-metrics.
Evaluation:
- Achieves state-of-the-art results on AMI, DIHARD, and VoxConverse.
- Offers low DER (Diarization Error Rate) with appropriate tuning.
- Community-supported with ongoing updates and extensions.
2. SpeechBrain
Developed by: SpeechBrain team at KU Leuven
Platform: Python (PyTorch)
License: Apache 2.0
Repository: https://github.com/speechbrain/speechbrain
Methodology:
SpeechBrain provides an extensible diarization pipeline built on:
- Pretrained Embedding Extractors (like ECAPA-TDNN).
- Cosine similarity + KMeans clustering.
- Optional VAD using integrated models or external sources like Silero.
Features:
- End-to-end ASR + diarization pipelines.
- Easily swappable models.
- Embedding extraction and clustering tools compatible with scikit-learn.
Evaluation:
- While its diarization performance is solid, it is often slightly behind PyAnnote in DER.
- It shines in flexibility and being part of a broader speech ecosystem (ASR, speaker recognition, etc.).
PyAnnote-Audio currently provides the most accurate and comprehensive solution, particularly for offline diarization. SpeechBrain offers flexibility and integration with a broader speech processing pipeline, while Resemblyzer and ClusteringDiarizer cater to lightweight and GPU-accelerated use cases