Timestamps
Every transcript is anchored in time. Here's how QuillAI exposes those anchors so you can jump, highlight, caption, and sync with the original audio or video.
Segments
The primary unit is result.segments — an array of short, phrase-level chunks (typically 2–15 seconds each). Every segment carries a start, an end, and its text. When speaker recognition is enabled, a speaker label is included too.
{
"segments": [
{ "start": 0.0, "end": 3.84, "text": "Welcome back to the channel." },
{ "start": 3.84, "end": 7.12, "text": "Today we're talking about timestamps." },
{ "start": 7.12, "end": 12.48, "text": "They're the backbone of every transcript.", "speaker": "Speaker 1" },
{ "start": 12.48, "end": 18.02, "text": "Let's dig into how they work.", "speaker": "Speaker 1" }
]
}Units and precision
Paragraphs
When you pass structure: true, QuillAI also groups segments into readable paragraphs under result.structured.paragraphs. Each paragraph spans multiple segments and has its own start / end boundaries — useful for a chaptered view, a readable article export, or anchoring summaries.
{
"structured": {
"paragraphs": [
{
"start": 0.0,
"end": 42.7,
"text": "Welcome back to the channel. Today we're talking about timestamps..."
}
]
}
}Subtitles
result.subtitles.vtt and result.subtitles.srt are presigned URLs to the generated caption files. Plug them straight into a <track> element, a video player, or an editor — no need to base64-decode or reformat.
WEBVTT
00:00:00.000 --> 00:00:03.840
Welcome back to the channel.
00:00:03.840 --> 00:00:07.120
Today we're talking about timestamps.
00:00:07.120 --> 00:00:12.480
They're the backbone of every transcript.Seeking to a moment
To jump to a specific point, just use the start of the segment you care about. Since it's already in seconds, it plugs directly into HTML5 media and most embeds.
Word-level timestamps
If you need a rough approximation, you can split a segment's text proportionally by character count across its [start, end] range. It's not perfectly accurate, but it's good enough for karaoke-style highlighting or word-by-word scroll.