Literature Review

Speaker diarization is the process of determining and identifying different speakers in an audio recording. It is readily used for transcription - the task of converting spoken audio into written text and voice biometrics - the task of uniquely identifying a person by analysing characteristics of their voice.

1. PyAnnote-Audio

Developed by: Hervé Bredin et al.

Platform: Python (PyTorch-based)

License: MIT

Repository: https://github.com/pyannote/pyannote-audio

Methodology:

PyAnnote-Audio adopts a modular pipeline:

Voice Activity Detection (VAD) using a neural network.
Speaker Embedding extraction using pretrained models (e.g., ECAPA-TDNN).
Clustering with Agglomerative Hierarchical Clustering (AHC), spectral clustering, or Bayesian HMM-based diarization.

Features:

Highly modular and configurable.
Accurate diarization across languages and domains.
Provides pretrained models for all components.
Integration with Hugging Face Transformers and pyannote-metrics.

Evaluation:

Achieves state-of-the-art results on AMI, DIHARD, and VoxConverse.
Offers low DER (Diarization Error Rate) with appropriate tuning.
Community-supported with ongoing updates and extensions.

2. SpeechBrain

Developed by: SpeechBrain team at KU Leuven

Platform: Python (PyTorch)

License: Apache 2.0

Repository: https://github.com/speechbrain/speechbrain

Methodology:

SpeechBrain provides an extensible diarization pipeline built on:

Pretrained Embedding Extractors (like ECAPA-TDNN).
Cosine similarity + KMeans clustering.
Optional VAD using integrated models or external sources like Silero.

Features:

End-to-end ASR + diarization pipelines.
Easily swappable models.
Embedding extraction and clustering tools compatible with scikit-learn.

Evaluation:

While its diarization performance is solid, it is often slightly behind PyAnnote in DER.
It shines in flexibility and being part of a broader speech ecosystem (ASR, speaker recognition, etc.).

PyAnnote-Audio currently provides the most accurate and comprehensive solution, particularly for offline diarization. SpeechBrain offers flexibility and integration with a broader speech processing pipeline, while Resemblyzer and ClusteringDiarizer cater to lightweight and GPU-accelerated use cases

Jaret's Wiki

Explorer

Literature Review