Meetings
Meeting 1
10 June
To-Do:
- Open-source diarisation models
- How to evaluate
- Writing up the evaluation
- Abstract
- Learn about Praat and Textgrids
- Look at data
Meeting 2
27 June
To-Do:
- What is DER ← Write in report
- Other error metrics
- Label ground truth adult
- Literature Review (Research) of open source models
- VTC get running
- Evaluation code
Meeting
23 July
To-Do:
- Send Textgrids for VAD (ground truths)
- Child vs adult discrimination
- Label child/adult speech segments
- Evaluation code - Figure out DER
- Train on our data
Meeting
30 July
To-Do:
- Finalize the ground truths
Meeting
6 Aug
To-Do:
- “Final” result (with all metrics) for a first model (on dev)
- Fine-tune on our data and check
- Train adult/child discriminator on our data
- Decide on metrics (after reading)
- Agreement form
Meeting
13 Aug
To-Do:
- Writing (including metrics & figures)
- Figure out DER hyperparameters in Pyannote
- kNN and LogReg for adult/child classification
- PC
Key Takeway from this meeting is how to build the adult/child classifier using speaker embeddings
Meeting
20 Aug
(Think Herman was away at Interspeech)
Meeting
27 Aug
To-Do:
- NeMo
- Read up on speechbrain embeddings (x-vector, but there are others)
- Logistic reg binary classifier (adult, child classifier)
- Fine-tuning
- Writing
Meeting
3 Sep
To-Do:
- MAC- address of dongle (Used Citrix VPN to connect to JABA in the end)
- Nemo diarization extraction using JABA
- Read up on speaker embeddings (SpeechBrain ECAPA-TDNN (spkrec-ecapa-voxceleb) & x-vector TDNN (spkrec-xvect-voxceleb))
- Fine-tuning
- Writing report
- If Time - Other classifiers (SVM, AHC etc) and other embedding types
Meeting
10 Sep
No meeting in holiday
Worked on writing, gather more research to include in report.
Coding wise, finished NeMo, VBx diarizations and also experimented with combining different VADs and embedding models that are capable of diarization. Used k-means for clustering (got decent performance) but can still try other clustering methods.
NeMo results on afr training set:
[Train] Average DER (collar=0.10s): 44.62% [Train] Average DER (collar=0.25s): 39.00% [Train] Average DER (collar=0.50s): 31.50% [Train] Average JER: 35.79%
VBx results on afr training set:
[Train] Evaluated 221 files. [Train] Average DER (collar=0.10s): 81.96% [Train] Average DER (collar=0.25s): 80.94% [Train] Average DER (collar=0.50s): 82.80% [Train] Average JER: 37.50%
VBx got punished harshly by DER through its inability to deal with overlapping speech, however JER is actually on-par with NeMo.
To-Do:
- Fine-tuning
- Writing