Literature Review

Speaker diarization is the process of determining and identifying different speakers in an audio recording. It is readily used for transcription - the task of converting spoken audio into written text and voice biometrics - the task of uniquely identifying a person by analysing characteristics of their voice.

1. PyAnnote-Audio

Developed by: Hervé Bredin et al.

Platform: Python (PyTorch-based)

License: MIT

Repository: https://github.com/pyannote/pyannote-audio

Methodology:

PyAnnote-Audio adopts a modular pipeline:

  • Voice Activity Detection (VAD) using a neural network.
  • Speaker Embedding extraction using pretrained models (e.g., ECAPA-TDNN).
  • Clustering with Agglomerative Hierarchical Clustering (AHC), spectral clustering, or Bayesian HMM-based diarization.

Features:

  • Highly modular and configurable.
  • Accurate diarization across languages and domains.
  • Provides pretrained models for all components.
  • Integration with Hugging Face Transformers and pyannote-metrics.

Evaluation:

  • Achieves state-of-the-art results on AMI, DIHARD, and VoxConverse.
  • Offers low DER (Diarization Error Rate) with appropriate tuning.
  • Community-supported with ongoing updates and extensions.

2. SpeechBrain

Developed by: SpeechBrain team at KU Leuven

Platform: Python (PyTorch)

License: Apache 2.0

Repository: https://github.com/speechbrain/speechbrain

Methodology:

SpeechBrain provides an extensible diarization pipeline built on:

  • Pretrained Embedding Extractors (like ECAPA-TDNN).
  • Cosine similarity + KMeans clustering.
  • Optional VAD using integrated models or external sources like Silero.

Features:

  • End-to-end ASR + diarization pipelines.
  • Easily swappable models.
  • Embedding extraction and clustering tools compatible with scikit-learn.

Evaluation:

  • While its diarization performance is solid, it is often slightly behind PyAnnote in DER.
  • It shines in flexibility and being part of a broader speech ecosystem (ASR, speaker recognition, etc.).

PyAnnote-Audio currently provides the most accurate and comprehensive solution, particularly for offline diarization. SpeechBrain offers flexibility and integration with a broader speech processing pipeline, while Resemblyzer and ClusteringDiarizer cater to lightweight and GPU-accelerated use cases