Metrics (DER)
Detection & Segmentation Metrics
Detection Metrics for voice activity or overlapped speech detection include:
- precision
- boundary recall
- F1-Score
Segmentation metrics for speaker change detection (i.e when adult is speaking to when child is speaking)
- purity
- coverage
These metrics are implemented in pyannote.metrics toolkit
Source: github.com/pyannote/pyannote-metrics
from pyannote.metrics.diarization import DiarizationErrorR
DER (Diarization Error Rate)
The Diarization Error Rate (DER) is the most common metric for evaluating speaker diarization systems. DER is calculated by summing the time duration of three distinct errors: speaker confusion, false alarms, and missed detections. This total duration is then divided by the overall time span.
False Alarm = when the model detects speech where there is none in the ground truth
Missed Detection = when the model does not detect speech when there is speech in the ground truth
Speaker confusion = when model assigns wrong speaker label to speech segments.
JER (Jaccard Error Rate)
Where:
-
is the total time of speech from speaker s in the reference RTTM.
-
is the total time of speech assigned to speaker s in the hypothesis RTTM (after speaker matching).
-
N is the number of reference speakers.
-
= overlap (intersection), and \cup = union of speech segments.
-
JER = 0: Perfect diarization (hypothesis matches reference completely).
-
JER = 1: Complete mismatch.
-
It is symmetric and per-speaker, making it better for speaker-level analysis than DER.
-
The DER and JER rely on alignment of time segments.
-
NMI and F1 use speaker label clustering.
-
Can tune overlap tolerance, ignore short segments, or post-process with smoothing or resegmentation to improve fairness.
-
For more precision: use uem, collar tolerance (e.g., 250ms), and handle overlapping speech specifically.
If your ground truth is highly sensitive and models under-segment (miss faint speech), DER will penalize heavily. In such cases:
-
Compare VAD-only DER vs Full DER.
-
Use relaxed collar tolerance (250ms) or sliding window comparison.
-
Use B-cubed precision/recall if you want frame-level evaluation.
Toolkits
-
pyannote.metrics: for DER, JER, purity
-
scikit-learn: for clustering metrics
-
dscore: official NIST DER scorer (if needed for benchmarks)
-
[tune_th]: smoothing thresholds and voice activity calibration
Metrics over
Fine-tuning refers to …