Core conceptsLanguages

Languages

QuillAI transcribes ~98 languages and auto-detects by default. Pass a language hint only when you need to override the detector — for short clips, noisy audio, or edge cases where it guesses wrong.

How language detection works

Omit the language field and the model identifies the spoken language from the audio itself, then returns the ISO-639-1 code it settled on in the language field of the Transcription object. No extra call, no latency penalty — detection runs inline.

Short clips are the weak spot. Auto-detection needs roughly 15 seconds of speech to lock onto a language reliably. For anything shorter — voicemails, jingles, one-line prompts — pass language explicitly to avoid misdetection.

Forcing a language

Pass an ISO-639-1 code (two letters, lowercase) in the language field of your POST /v1/transcriptions body. The model skips detection and transcribes under the language you specified.

force-language.shbash

curl -X POST https://api.quillhub.ai/v1/transcriptions \
  -H "Authorization: Bearer $QAI_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://youtu.be/dQw4w9WgXcQ",
    "language": "en"
  }'

The response echoes your value back in language, so downstream code can treat the field the same way whether detection ran or not.

Supported languages

Around 98 languages are supported end-to-end. Quality varies by tier — top-tier languages are production-ready; long-tail languages work but may need light post-editing. A representative sample is below; see the API reference for the full list.

Code	Name	Tier
en	English	Top
es	Spanish	Top
fr	French	Top
de	German	Top
ru	Russian	Top
pt	Portuguese	Top
it	Italian	Top
zh	Chinese (Mandarin)	Top
ja	Japanese	Top
ko	Korean	Top
nl	Dutch	Standard
pl	Polish	Standard
tr	Turkish	Standard
hi	Hindi	Standard
vi	Vietnamese	Long-tail

Most European and common South/Southeast Asian languages fall into the Standard tier. Long-tail coverage extends to regional languages with lower training volume — transcription works, but expect occasional accuracy drops.

Mixed-language audio

If a file contains more than one language, auto-detection picks the dominant one and transcribes the entire file under that single language. There is no per-segment language switching — words in the minority language will be transliterated or approximated by the dominant model.

If you need clean output for each language, split the audio on silence or speaker boundaries and submit each segment as a separate transcription with an explicit language.

Accuracy tips

Force language on clips shorter than ~15 seconds — auto-detect does not have enough signal.
Clean input helps more than any parameter. Mono voice tracks at 16 kHz+ with minimal background music consistently outperform noisy stereo mixes.
Brand names, product names, and uncommon acronyms sometimes land phonetically. Plan a post-edit pass or a find-and-replace step for domain-specific vocabulary.
Speaker diarization and language forcing combine without issue — turn both on in the same request when you need labelled speakers in a known language.

← Previous

Diarization

Timestamps