Speaker change detection

Transcription:Batch Real-Time Deployments:All

Note: We recommend using speaker diarization instead of speaker change due to improvements in speaker detection accuracy.

This feature introduces markers into the JSON transcript only that indicate when a speaker change has been detected in the audio. For example, if the audio contains two people speaking to each other, and you want the transcript to show when there is a change of speaker, specify speaker_change as the diarization setting:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "speaker_change"
  }
}

The transcript will have special json elements in the results array between two words where a different person started talking. For example, if one person says "Hello James" and the other responds with "Hi", there will a speaker_change json element between "James" and "Hi".

"results": [
  {
    "start_time": 0.1,
    "end_time": 0.22,
    "type": "word",
    "alternatives": [
      {
          "confidence": 0.71,
          "content": "Hello",
          "language": "en",
          "speaker": "UU"
      }
    ]
  },
  {
    "start_time": 0.22,
    "end_time": 0.55,
    "type": "word",
    "alternatives": [
      {
          "confidence": 0.71,
          "content": "James",
          "language": "en",
          "speaker": "UU"
      }
    ]
  },
  {
    "start_time": 0.55,
    "end_time": 0.55,
    "type": "speaker_change",
    "alternatives": []
  },
  {
    "start_time": 0.56,
    "end_time": 0.61,
    "type": "word",
    "alternatives": [
      {
          "confidence": 0.71,
          "content": "Hi",
          "language": "en",
          "speaker": "UU"
      }
    ]
  }
]

The sensitivity of the speaker change detection is set to a sensible default that gives the optimum performance under most circumstances. You can however change this if you with using the speaker_change_sensitivity setting, which takes a value between 0 and 1 (the default is 0.4). The higher the sensitivity setting, the more likelihood of a speaker change being indicated. We've found through our own experimentation that values outside the range 0.3-0.6 produce too few speaker change events, or too many false positives. Here's an example of how to set the value:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "speaker_change",
    "speaker_change_sensitivity": 0.55
  }
}