Detection Metrics

Metrics (DER)

Detection & Segmentation Metrics

Detection Metrics for voice activity or overlapped speech detection include:

precision
boundary recall
F1-Score

Segmentation metrics for speaker change detection (i.e when adult is speaking to when child is speaking)

purity
coverage

These metrics are implemented in pyannote.metrics toolkit

Source: github.com/pyannote/pyannote-metrics

from pyannote.metrics.diarization import DiarizationErrorR

DER (Diarization Error Rate)

The Diarization Error Rate (DER) is the most common metric for evaluating speaker diarization systems. DER is calculated by summing the time duration of three distinct errors: speaker confusion, false alarms, and missed detections. This total duration is then divided by the overall time span.

$D ER = \frac{f a l se a l a r m + mi sse d d e t ec t i o n + s p e ak er co n f u s i o n}{t o t a l s p eec h}$

False Alarm = when the model detects speech where there is none in the ground truth

Missed Detection = when the model does not detect speech when there is speech in the ground truth

Speaker confusion = when model assigns wrong speaker label to speech segments.

JER (Jaccard Error Rate)

$J (s) = \frac{re f _{s} \cap h y p _{s}}{re f _{s} \cup h y p _{s}}$

$J ER = 1 - \frac{1}{N} \sum_{s = 1}^{N} J (s)$

Where:

$ref_{s}$ is the total time of speech from speaker s in the reference RTTM.
$hyp_{s}$ is the total time of speech assigned to speaker s in the hypothesis RTTM (after speaker matching).
N is the number of reference speakers.
$\cap$ = overlap (intersection), and \cup = union of speech segments.
JER = 0: Perfect diarization (hypothesis matches reference completely).
JER = 1: Complete mismatch.
It is symmetric and per-speaker, making it better for speaker-level analysis than DER.
The DER and JER rely on alignment of time segments.
NMI and F1 use speaker label clustering.
Can tune overlap tolerance, ignore short segments, or post-process with smoothing or resegmentation to improve fairness.
For more precision: use uem, collar tolerance (e.g., 250ms), and handle overlapping speech specifically.

If your ground truth is highly sensitive and models under-segment (miss faint speech), DER will penalize heavily. In such cases:

Compare VAD-only DER vs Full DER.
Use relaxed collar tolerance (250ms) or sliding window comparison.
Use B-cubed precision/recall if you want frame-level evaluation.

Toolkits

pyannote.metrics: for DER, JER, purity
scikit-learn: for clustering metrics
dscore: official NIST DER scorer (if needed for benchmarks)
[tune_th]: smoothing thresholds and voice activity calibration

Metrics over

Fine-tuning refers to …

Jaret's Wiki

Explorer

Detection Metrics

Metrics (DER)

Detection & Segmentation Metrics

DER (Diarization Error Rate)

JER (Jaccard Error Rate)

Graph View

Table of Contents

Backlinks