Speaker diarization
Split a transcript into turns by speaker. Enable it with one flag and get a speaker label on every segment.
What it does
When speaker recognition is on, the API analyzes the audio and assigns each segment to a distinct voice. You get the same transcript text, plus a speaker label on every segment in result.segments[] so you can reconstruct who said what.
Enabling it
Pass speaker_recognition: true in the request body. It defaults to false, so you only pay the extra processing time when you actually need it.
curl -X POST https://api.quillhub.ai/v1/transcriptions \
-H "Authorization: Bearer $QAI_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/interview.mp3",
"speaker_recognition": true
}'Response shape
With diarization enabled, each entry in result.segments[] gains a speaker field alongside start, end, and text. The top-level result.text is unchanged — it's still the plain concatenated transcript — so group by speaker using result.segments.
{
"id": "trs_01HZX9K7Q2M4YV8BTA6JRN3PDE",
"status": "completed",
"result": {
"text": "Welcome to the show. Thanks for having me. So let's start with your background. Sure, I grew up in Boston and studied computer science.",
"segments": [
{ "start": 0.00, "end": 2.41, "text": "Welcome to the show.", "speaker": "Speaker 1" },
{ "start": 2.48, "end": 4.12, "text": "Thanks for having me.", "speaker": "Speaker 2" },
{ "start": 4.20, "end": 7.05, "text": "So let's start with your background.", "speaker": "Speaker 1" },
{ "start": 7.15, "end": 9.88, "text": "Sure, I grew up in Boston", "speaker": "Speaker 2" },
{ "start": 9.92, "end": 12.60, "text": "and studied computer science.", "speaker": "Speaker 2" }
]
}
}Grouping by speaker
Adjacent segments from the same speaker are usually a single turn. A tiny reducer on the client collapses them into a clean list of turns for rendering:
// segments: [{ start, end, text, speaker }, ...]
function groupBySpeaker(segments) {
const groups = [];
for (const seg of segments) {
const last = groups[groups.length - 1];
if (last && last.speaker === seg.speaker) {
last.lines.push(seg.text);
} else {
groups.push({ speaker: seg.speaker, lines: [seg.text] });
}
}
return groups;
}
const turns = groupBySpeaker(result.segments);Accuracy and limits
- Works best on clear audio with distinct voices — interviews, podcasts, meetings with separate mics.
- Accuracy drops on heavy overlap, cross-talk, background noise, or low-bitrate audio.
- Labels are generic placeholders ("Speaker 1", "Speaker 2", …) — you map them to real names yourself.
- Labels are consistent within one transcription but not across different ones. Speaker 1 in file A is unrelated to Speaker 1 in file B.
Cost
Diarization adds roughly 10–15% processing time and is not billed separately — the transcription cost is the same whether the flag is on or off.