Core conceptsDiarization

Speaker diarization

Split a transcript into turns by speaker. Enable it with one flag and get a speaker label on every segment.

What it does

When speaker recognition is on, the API analyzes the audio and assigns each segment to a distinct voice. You get the same transcript text, plus a speaker label on every segment in result.segments[] so you can reconstruct who said what.

Separation, not identification. QuillAI tells speakers apart within a single transcription but does not match them to real people or recognize the same person across different files. Labels are generic — "Speaker 1", "Speaker 2", and so on.

Enabling it

Pass speaker_recognition: true in the request body. It defaults to false, so you only pay the extra processing time when you actually need it.

request.shbash

curl -X POST https://api.quillhub.ai/v1/transcriptions \
  -H "Authorization: Bearer $QAI_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/interview.mp3",
    "speaker_recognition": true
  }'

Response shape

With diarization enabled, each entry in result.segments[] gains a speaker field alongside start, end, and text. The top-level result.text is unchanged — it's still the plain concatenated transcript — so group by speaker using result.segments.

response.jsonjson

{
  "id": "trs_01HZX9K7Q2M4YV8BTA6JRN3PDE",
  "status": "completed",
  "result": {
    "text": "Welcome to the show. Thanks for having me. So let's start with your background. Sure, I grew up in Boston and studied computer science.",
    "segments": [
      { "start": 0.00, "end": 2.41, "text": "Welcome to the show.", "speaker": "Speaker 1" },
      { "start": 2.48, "end": 4.12, "text": "Thanks for having me.", "speaker": "Speaker 2" },
      { "start": 4.20, "end": 7.05, "text": "So let's start with your background.", "speaker": "Speaker 1" },
      { "start": 7.15, "end": 9.88, "text": "Sure, I grew up in Boston", "speaker": "Speaker 2" },
      { "start": 9.92, "end": 12.60, "text": "and studied computer science.", "speaker": "Speaker 2" }
    ]
  }
}

Grouping by speaker

Adjacent segments from the same speaker are usually a single turn. A tiny reducer on the client collapses them into a clean list of turns for rendering:

group-by-speaker.jsjavascript

// segments: [{ start, end, text, speaker }, ...]
function groupBySpeaker(segments) {
  const groups = [];
  for (const seg of segments) {
    const last = groups[groups.length - 1];
    if (last && last.speaker === seg.speaker) {
      last.lines.push(seg.text);
    } else {
      groups.push({ speaker: seg.speaker, lines: [seg.text] });
    }
  }
  return groups;
}

const turns = groupBySpeaker(result.segments);

Accuracy and limits

Works best on clear audio with distinct voices — interviews, podcasts, meetings with separate mics.
Accuracy drops on heavy overlap, cross-talk, background noise, or low-bitrate audio.
Labels are generic placeholders ("Speaker 1", "Speaker 2", …) — you map them to real names yourself.
Labels are consistent within one transcription but not across different ones. Speaker 1 in file A is unrelated to Speaker 1 in file B.

Cost

Diarization adds roughly 10–15% processing time and is not billed separately — the transcription cost is the same whether the flag is on or off.

← Previous

Transcriptions

Languages