Notes

Notes from 1st meeting:

Open Source Diarization models How to evaluate a diarization model

Boundary Recall & F1 score

Diarization measures

Using whisper and feed to a LLM to seperate who is speaking (sort of like a dialog between two people )

Learn about textgrids (praat)

Babaloon (Look at )

For next meeting Explain my topic

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

Tried so far:

  • To get voice_type_classifier running on mac (might just be some outdated dependencies in github repo)
  • ALICE for counting number of words, syllables and phonemes in adult speakers
  • WhisperX requires using Nvidia Cuda. Does not work with CPU.
  • PyannoteAI activated for a month. The best diarization model so far (not open source)
  • So far, pyannote, reverb and VBx is working

Diarization models:

  • pyannote
  • reverb
  • silero vad - just detects when someone is speaking and not who is speaking. https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb#scrollTo=5w5AkskZ2Fwr. Found this silero-vad playground which takes an example.wav and based on the timestamp, splits the audio file to create a new one without “empty noises”. So based on that i thought it would be a good idea to take the timestamp and use it to split speakers completely and create their own audio files. This can be used for easy comparison of models based on how good the audio was split so that no overlapping voices can be heard from each others Audio Technique

Complete diarization models

PyannoteAI API (monthly subscription)

Source: https://dashboard.pyannote.ai/

Pyannote (open-source)

Source: https://huggingface.co/pyannote/speaker-diarization-3.1

Reverb-diarization-v2

Source: https://huggingface.co/Revai/reverb-diarization-v2

Possible other models

Silero-vad + speechbrain

Silero-vad

Source: https://github.com/snakers4/silero-vad/?tab=readme-ov-file

Speechbrain

Source: https://github.com/snakers4/silero-vad/?tab=readme-ov-file