Teams are shipping more video than ever, and manual subtitling does not scale. In a typical week, you might edit a batch of clips, export deliverables for multiple platforms, and then scramble to produce subtitles and captions in several languages. This is a perfect use case for an automated pipeline.
I’m building a workflow for my team and want to minimize manual work on accessibility. How to automate video subtitles and captions? Ideally I want to generate text from audio, create SRT and VTT, optionally translate, and then either ship sidecar tracks or embed them in the video. I’m looking for a vendor-neutral approach using common tools like FFmpeg, plus tips for hosting and delivery. Any examples for batch processing would be great.
You can automate captions using a three-stage pipeline: transcribe, format, and deliver. Start with speech-to-text, convert results to subtitle formats, then attach them to your videos for players and platforms that support closed captions. Here is a battle-tested approach.
Use an automatic speech recognition engine to generate time-coded text. Many teams use local or cloud ASR. A lightweight example with the open source Whisper CLI:
# 1. Install whisper (example: pip install openai-whisper ffmpeg-python)
# 2. Generate SRT and VTT for a single video
whisper input.mp4 --task transcribe --language en --model small --output_format srt,vtt --output_dir subs
# Batch process all MP4s in a folder
for f in *.mp4; do
whisper "$f" --task transcribe --language en --model small --output_format srt,vtt --output_dir subs
doneCode language: PHP (php)
Tips:
- Preprocess audio for accuracy by downmixing to mono, normalize levels, and remove silence.
- Keep line length under 42 characters and target reading speed around 15 cps.
- Choose captions for accessibility with non-speech cues vs subtitles for dialog only.
You will primarily use SRT and WebVTT. SRT is widely supported, while VTT is the native format for HTML5 tracks. If your ASR outputs only one format, convert as needed. You can use FFmpeg and text tools in your scripts.
# Example: ensure UTF-8 without BOM and Unix line endings
dos2unix subs/input.en.srt
iconv -f utf-8 -t utf-8 -c subs/input.en.srt > subs/input.en.clean.srt
# Optional: translate subtitles in your pipeline, then output per-locale files
# subs/input.es.vtt, subs/input.fr.vtt, etc.Code language: PHP (php)
Most distribution stacks prefer sidecar VTT because it enables on-off toggling and keeps the video master clean. For web playback, use HTML5 tracks:
<video controls crossorigin="anonymous" width="800">
<source src="videos/intro.mp4" type="video/mp4">
<track src="subs/intro.en.vtt" kind="subtitles" srclang="en" label="English" default>
<track src="subs/intro.es.vtt" kind="subtitles" srclang="es" label="Español">
</video>Code language: HTML, XML (xml)
If you must embed captions in MP4, use FFmpeg with mov_text:
# Embed SRT as a subtitle track in MP4
ffmpeg -i input.mp4 -i subs/input.en.srt -c copy -c:s mov_text output-with-cc.mp4Code language: PHP (php)
MP4 is widely supported across devices. If you are delivering HLS or DASH, keep captions as WebVTT sidecars to align with streaming best practices described in video encoding best practices.
- Run a linter check for SRT numbering and overlapping cues.
- Normalize punctuation, fix capitalization, and label speakers where needed.
- Store per-locale files using a consistent pattern.
- Treat subtitle files as build artifacts and publish them alongside the video.
After you generate SRT and VTT, you can upload and deliver everything from a single place. For example, upload your video and subtitles, then play them with standard HTML5 tracks using Cloudinary delivery URLs.
// Node.js example
import { v2 as cloudinary } from "cloudinary";
cloudinary.config({
cloud_name: "<cloud_name>",
api_key: "<api_key>",
api_secret: "<api_secret>"
});
// Upload video
await cloudinary.uploader.upload("input.mp4", {
resource_type: "video",
public_id: "demos/intro"
});
// Upload VTT as raw
await cloudinary.uploader.upload("subs/input.en.vtt", {
resource_type: "raw",
public_id: "demos/intro.en"
});
<video controls width="800">
<source src="https://res.cloudinary.com/<cloud_name>/video/upload/demos/intro.mp4" type="video/mp4">
<track src="https://res.cloudinary.com/<cloud_name>/raw/upload/demos/intro.en.vtt"
kind="subtitles" srclang="en" label="English" default>
</video>Code language: PHP (php)
If you need format conversions in your toolchain, Cloudinary’s tools can speed up quick tests and one-off conversions while your pipeline handles batch work.
- No captions in MP4 after embed: Ensure
-c:s mov_text. - Subtitles out of sync: Align the ASR timestamps or re-time with offsets.
- Tracks not showing in HTML5: Confirm
srclangand file encoding is UTF-8.
Automate captions with an ASR step, output SRT and VTT, then deliver as sidecar tracks for HTML5 or embed them with FFmpeg where required. Validate cues, store per-locale files consistently, and host both video and subtitles from a single origin for reliable delivery.
Ready to streamline captioning and delivery across your stack? Sign up for Cloudinary free and start managing your videos and subtitle tracks in one place.