Auto Captions API

POST /captions/auto

About this tool

Accessibility, SEO, and international distribution all lean on timed text, so `/captions/auto` runs speech-aware models plus FFmpeg packaging to emit subtitles directly from your `video_url`. One JSON field is enough to start: we fetch the media, decode audio, infer transcripts, and return caption files through the standard async task channel. Multipart uploads never appear; your subscription key authorizes every hop, and public HTTPS remains mandatory for source media. Mention audience reading speed in ancillary docs because karaoke-style pacing still requires human tweaking even when timestamps are mechanically sound.

Caption quality tracks audio clarity more than video resolution. Noisy stadium footage needs upstream denoise or manual translation; otherwise word error rates climb despite strong models. Legal teams should treat auto captions as drafts for high-stakes content, then human-review before publication. FFmpeg muxing modes—sidecar versus embedded—are spelled out alongside this anchor in `/api-docs`.

Downstream pipelines often chain `/captions/auto` immediately after `/cut` or `/redub` so timings reference final audio. Observability dashboards track queue depth separately from mux-only jobs since speech models consume GPU batches. Burst responsibly: batch overnight when launching entire course catalogs.

Compared to SaaS editors with manual timelines, API automation trades UI polish for repeatable JSON contracts engineers love. Embed correlation IDs inside your logging wrappers so CX can trace learner complaints back to singular task IDs, and annotate player manifests with FFmpeg export tags so CDN edge rules pick the subtitle variant that matched the caption job UTC timestamp.

Security posture mirrors other tools—never embed API keys inside browser extensions without tight scope, rotate secrets quarterly, and redact captions before sharing diagnostics externally since they may contain PII. Higher-education LMS integrators routinely split hour-long seminars into queued `/cut` chunks before captions run, shrinking per-task clocks and enabling cheap partial retries when one segment fails. Mention locale and jargon vocabularies in onboarding docs—even strong acoustic models stumble on densely acronymic engineering lectures.

Try it now

How it works

  1. Verify the `video_url`

    Confirm hosting allows TLS fetches without cookies and that audio tracks exist; silent videos return fast validation errors.

  2. POST `/captions/auto`

    Submit JSON with your API key. The response includes `status_url` even though the body is tiny—no binary uploads occur.

  3. Poll models + mux stages

    Track asynchronous states until FFmpeg finishes packaging captions in the promised format from documentation.

  4. Deliver captions to players

    Download sidecars or embedded outputs, register them in your CMS, and optionally translate before publication.

Frequently asked questions

Which caption formats return?

Consult the anchored parameter matrix—common exports include WebVTT and SRT variants. The API prefers explicit format flags when multiple are allowed.

Can I pass separate audio?

This route focuses on `video_url` carriers. If you only have isolated audio, consider wrapper containers or other documented endpoints.

How accurate are transcripts?

Accuracy depends on language, noise, and accent diversity. Always budget human QC for compliance-sensitive programming.

Why async only?

Speech models and FFmpeg muxing exceed comfortable HTTP timeouts. Async tasks let you scale polling clients independently.

Do subscriptions cap minutes?

Yes—tiers meter processed seconds. Watch dashboards before running entire broadcast archives in one day.

Is multipart supported for audio uploads?

No. Host files with HTTPS URLs and reference them in JSON like every Droid Apps FFmpeg tool.