Meetings

Meeting 1

10 June

To-Do:

  • Open-source diarisation models
  • How to evaluate
  • Writing up the evaluation
  • Abstract
  • Learn about Praat and Textgrids
  • Look at data

Meeting 2

27 June

To-Do:

  • What is DER Write in report
  • Other error metrics
  • Label ground truth adult
  • Literature Review (Research) of open source models
  • VTC get running
  • Evaluation code

Meeting

23 July

To-Do:

  • Send Textgrids for VAD (ground truths)
  • Child vs adult discrimination
  • Label child/adult speech segments
  • Evaluation code - Figure out DER
  • Train on our data

Meeting

30 July

To-Do:

  • Finalize the ground truths

Meeting

6 Aug

To-Do:

  • “Final” result (with all metrics) for a first model (on dev)
  • Fine-tune on our data and check
  • Train adult/child discriminator on our data
  • Decide on metrics (after reading)
  • Agreement form

Meeting

13 Aug

To-Do:

  • Writing (including metrics & figures)
  • Figure out DER hyperparameters in Pyannote
  • kNN and LogReg for adult/child classification
  • PC

Key Takeway from this meeting is how to build the adult/child classifier using speaker embeddings

Meeting

20 Aug

(Think Herman was away at Interspeech)

Meeting

27 Aug

To-Do:

  • NeMo
  • Read up on speechbrain embeddings (x-vector, but there are others)
  • Logistic reg binary classifier (adult, child classifier)
  • Fine-tuning
  • Writing

Meeting

3 Sep

To-Do:

  • MAC- address of dongle (Used Citrix VPN to connect to JABA in the end)
  • Nemo diarization extraction using JABA
  • Read up on speaker embeddings (SpeechBrain ECAPA-TDNN (spkrec-ecapa-voxceleb) & x-vector TDNN (spkrec-xvect-voxceleb))
  • Fine-tuning
  • Writing report
  • If Time - Other classifiers (SVM, AHC etc) and other embedding types

Meeting

10 Sep

No meeting in holiday

Worked on writing, gather more research to include in report.

Coding wise, finished NeMo, VBx diarizations and also experimented with combining different VADs and embedding models that are capable of diarization. Used k-means for clustering (got decent performance) but can still try other clustering methods.

NeMo results on afr training set:

[Train] Average DER (collar=0.10s): 44.62% [Train] Average DER (collar=0.25s): 39.00% [Train] Average DER (collar=0.50s): 31.50% [Train] Average JER: 35.79%

VBx results on afr training set:

[Train] Evaluated 221 files. [Train] Average DER (collar=0.10s): 81.96% [Train] Average DER (collar=0.25s): 80.94% [Train] Average DER (collar=0.50s): 82.80% [Train] Average JER: 37.50%

VBx got punished harshly by DER through its inability to deal with overlapping speech, however JER is actually on-par with NeMo.

To-Do:

  • Fine-tuning
  • Writing