Audio Technique

Source: https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb#scrollTo=aI_eydBPjsrx

Found this silero-vad playground which takes an example.wav and based on the timestamp, splits the audio file to create a new one without “empty noises”.

Example Audio

Silero-VAD

PyannoteAI API (hello_world.py)

pyannote (open-source on huggingface)

Note: The background noise (laughter, silence etc.) is removed.

reverb

(Reverb didn’t change much from original.)

So based on that i thought it would be a good idea to take the timestamp and use it to split speakers completely and create their own audio files. This can be used for easy comparison of models based on how good the audio was split so that no overlapping voices can be heard from each others.

Silero-VAD + speechbrain

This model is still a work in progress. Had to try and combine Silero VAD with a speaker diarization model, for which I chose speechbrain’s ECAPA-TDNN for Speaker Embeddings.

==Note: bad performance==

PyannoteAI API (hello_world.py)

A good performing model is the (non-open source) Pyannote API.

speaker 1:

speaker 2:

pyannote (open-source on huggingface)

The open-source pyannote model has a little worse performance than the API (expected). Still massively outperforms Silero-VAD+speechbrain model.

reverb

==Note: bad performance==

Babaloon Comparison

Results do seem to perform better on Babaloon compared to Video example above for reasons not obvious to me (might have something to do with the sample rate (16kHz) or audio channel (mono)).

Not all models have perfect separation (clear from listening to the audio files in my repo). Overall, not too bad.

Now, must find a way to numerically show the strengths and weaknesses of every diarization model.

Possible approaches:

(Look at Silero VAD github source link that shows a precision-recall curve between it and other VAD models)