Speaker Diarization
Transcription:Batch Real-Time Deployments:AllOverview
Speaker Diarization aggregates all audio channels into a single stream for processing, and picks out different speakers based on acoustic matching.
The feature is disabled by default. To enable speaker Diarization the following must be set when you are using the config object:
{
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "speaker"
}
}
When enabled, every word
and punctuation
object in the output results will be a given "speaker" property which is a label
indicating who said that word. There are two kinds of labels you will see:
S#
- S stands for speaker and the # will be an incrementing integer identifying an individual speaker. S1 will appear first in the results, followed by S2 and S3 etc.UU
- Diarization is disabled or individual speakers cannot be identified.UU
can appear for example if some background noise is transcribed as speech, but the diarization system does not recognise it as a speaker.
Note: Enabling diarization for a file increases the amount of time taken to transcribe an audio file. In general, we expect the use of diarization to increase the overall processing time by 10-50%.
The example below shows relevant parts of a transcript with 3 speakers. The output shows the configuration information passed in the config.json
object and relevant segments with the different speakers in the JSON output. Only part of the transcript is shown here to highlight how different speakers are displayed in the output.
"format": "2.8",
"metadata": {
"created_at": "2020-07-01T13:26:48.467Z",
"type": "transcription",
"language_pack_info": {
"adapted": false,
"itn": true,
"language_description": "English",
"word_delimiter": " ",
"writing_direction": "left-to-right"
},
"transcription_config": {
"language": "en",
"diarization": "speaker"
}
},
"results": [
{
"alternatives": [
{
"confidence": 0.93,
"content": "hello",
"language": "en",
"speaker": "S1"
}
],
"end_time": 0.51,
"start_time": 0.36,
"type": "word"
},
{
"alternatives": [
{
"confidence": 1.0,
"content": "hi",
"language": "en",
"speaker": "S2"
}
],
"end_time": 12.6,
"start_time": 12.27,
"type": "word"
},
{
"alternatives": [
{
"confidence": 1.0,
"content": "good",
"language": "en",
"speaker": "S3"
}
],
"end_time": 80.63,
"start_time": 80.48,
"type": "word"
}
In our JSON output, start_time
identifies when a person starts speaking each utterance and end_time
identifies when they finish speaking.
Speaker Diarization tuning
Transcription:Batch Deployments:AllThe sensitivity of the speaker detection is set to a sensible default that gives the optimum performance under most circumstances. However, you can change this value based on your specific requirements by using the `speaker_sensitivity` setting in the `speaker_diarization_config` section of the job config object, which takes a value between 0 and 1 (the default is 0.5). A higher sensitivity will increase the likelihood of more unique speakers returning. For example, if you see fewer speakers returned than expected, you can try increasing the sensitivity value, or if too many speakers are returned try reducing this value. It's not guaranteed to change since several factors can affect the number of speakers detected. Here's an example of how to set the value:
{
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "speaker",
"speaker_diarization_config": {
"speaker_sensitivity": 0.6
}
}
}
Max speakers
Transcription:Real-Time Deployments:AllThe maximum number of speakers can be limited to a set number, to prevent too many speakers being predicted. This can be set using the `max_speakers` parameter in the `speaker_diarization_config` (which is sent as part of the `StartRecognition` message). The default value is 20, but it can take any integer value between 2 and 20 inclusive. Here is an example `StartRecognition` message with the `max_speakers` parameter set:
{
"message": "StartRecognition",
"audio_format": {
"type": "raw",
"encoding": "pcm_f32le",
"sample_rate": 48000
},
"transcription_config": {
"language": "en",
"operating_point": "enhanced",
"diarization": "speaker",
"speaker_diarization_config": {
"max_speakers": 10
}
}
}
Speaker Diarization and Punctuation
To enhance the accuracy of our Speaker Diarization, we make small corrections to the speaker labels based on the punctuation in the transcript. For example if our system originally thought that 9 words in a sentence were spoken by speaker S1, and only 1 word by speaker S2, we will correct the incongruous S2 label to be S1. This only works if punctuation is enabled in the transcript.
If you disable punctuation by removing end of sentence punctuation through permitted_marks
in the punctuation_overrides
section then diarization will not work correctly.
Changing the punctuation sensitivity will also affect the accuracy of Speaker Diarization.
Speaker Diarization timeout
Speaker Diarization will time out if it takes too long to run for a particular audio file. Currently, the timeout is set to 5 minutes or 0.5 * the audio duration; whichever is longer. For example, with a 2-hour audio file the timeout is 1 hour. If a timeout happens the transcript will still be returned but without the speaker labels set.
If a timeout occurs then all speaker labels in the output will be labelled as UU.
Under normal operation we do not expect diarization to timeout, but diarzation can be affected by a number of factors including audio quality and the number of speakers. If you do encounter timeouts frequently then please get in contact with Speechmatics support.