Ground Truth

Textgrids

TextGrid object consists of a number of tiers. There are two kinds of tiers: an interval tier is a connected sequence of labelled intervals, with boundaries in between. A point tier is a sequence of labelled points.

Essentially, Textgrids gives intervals of text which aligns with the audio.

This is handy because it can and will be used as the ground truth to which the different diarization models will be assessed.

I coded a function that can take a textgrid and convert it to RTTM format.

Note: The textgrids is only for child speech, and the diarization models identify all speech. Should not be an issue. Just a slight inconvenience to get the diarization and groundtruth RTTM’s in similar format.

Using the RTTM format

Some issues with the ground truths:

The Pyannote API detected 1 speaker for some audio.

The reasons I noticed:

  • The child is talking very quietly (id_145_main_pre_test.wav)
  • The child pitch is too high. Possibly whispering. Also just not talking (id_133_main_pre_test.wav)

The Pyannote API detected 3 speakers for some audio.

The reasons I noticed:

  • Background noises (other kids & teachers, birds). Good examples: Background noise in id_18_main_post_test.wav. A bird was detected in id_141_main_post_test.wav.
  • Different amplitude levels (kid whispers and then talks normally) misleading diarization
  • Whats going on with id_200_main_post_test.wav?? More than one child

Also the diarization is most definitely going to struggle with these cases:

  • Continuous talking (id_234_main_pre_test.wav)
  • What’s going on with id_281_main_pre_test.wav?? The recording has a bunch of noise and low volume.