Programmable Media

Google AI Video Transcription

Last updated: Apr-18-2024

Cloudinary is a cloud-based service that provides an end-to-end image and video management solution including uploads, storage, transformations, optimizations and delivery. Cloudinary's video solution includes a rich set of video transformation capabilities, including cropping, overlays, optimizations, and a large variety of special effects.

With the Google AI Video Transcription add-on, you can automatically generate speech-to-text transcripts of videos that you or your users upload to your product environment. The add-on applies powerful neural network models to your videos using Google's Cloud Speech API to get the best possible speech recognition results. The add-on supports transcribing videos in almost any language.

You can parse the contents of the returned transcript file to display the transcript of your video on your page, making your content more skimmable, accessible, and SEO-friendly.

When you deliver the videos, it only takes a single URL parameter to automatically insert the generated transcript into your video in the form of subtitles, which are exactly aligned to the timing of each spoken word. Alternatively, you can specify the (optionally) returned vtt or srt file as a video track so that users can toggle the subtitles on or off.

video with automatic transcript and subtitles

Getting started

Before you can use the Google AI Video Transcription add-on:

  • You must have a Cloudinary account. If you don't already have one, you can sign up for a free account.

  • Register for the add-on: make sure you're logged in to your account and then go to the Add-ons page. For more information about add-on registrations, see Registering for add-ons.

  • Keep in mind that many of the examples on this page use our SDKs. For SDK installation and configuration details, see the relevant SDK guide.

  • If you are new to Cloudinary, you may want to take a look at How to integrate Cloudinary in your app for a walk through on the basics of creating and setting up your account, working with SDKs, and then uploading, transforming and delivering assets.

Requesting video transcription

To request a transcript for a video or audio file (in the default US English language), include the raw_convert parameter with the value google_speech in your upload or update call. (For other languages, see transcription languages below.)

For example:

You can use upload presets to centrally define a set of upload options including add-on operations to apply, instead of specifying them in each upload call. You can define multiple upload presets, and apply different presets in different upload scenarios. You can create new upload presets in the Upload page of the Console Settings or using the upload_presets Admin API method. From the Upload page of the Console Settings, you can also select default upload presets to use for image, video, and raw API uploads (respectively) as well as default presets for image, video, and raw uploads performed via the Media Library UI.

Learn more: Upload presets

The google_speech parameter value activates a call to Google's Cloud Speech API, which is performed asynchronously after your original method call is completed. Thus your original method call response displays a pending status:

When the google_speech request is complete (may take several seconds or minutes depending on the length of the video), a new raw file is created in your product environment with the same public ID as your video or audio file and with the .transcript file extension. You can additionally request a standard subtitle format such as 'vtt' or 'srt'.

If you also provided a notification_url in your method call, the specified URL then receives a notification when the process completes:

Transcription languages

If your video/audio file is in a language other than US English, you can request transcription in the relevant language and (optionally) region/dialect.

For example, to request a video transcript in Canadian French when uploading the video abt_cloudinary_french.mp4:

You can specify just the 2 character language code or the full language + region code. For a full list of supported language and region codes, see the Google Cloud speech-to-text language support list.

Cloudinary transcript files

The created .transcript file includes details of the audio transcription, for example:

Each excerpt of text has a confidence value, and is followed by a breakdown of individual words and their specific start and end times.

Subtitle length and confidence levels

Google returns transcript excerpts of varying lengths. When displaying subtitles, long excerpts are automatically divided into 20 word entities and displayed on two lines.

You can also optionally set a minimum confidence level for your subtitles, for example: l_subtitles:my-video-id.transcript:90. In this case, any excerpt that Google returns with a lower confidence value will be omitted from the subtitles. Keep in mind that in some cases, this may exclude several sentences at once.

Generating standard subtitle formats

If you want to include the transcript as a separate track for a video player, you can also request that cloudinary create an SRT and/or WebVTT raw file by including the srt and/or vtt qualifiers (separated by a colon) with the google_speech value. For example, to upload a video and also request both srt and vtt files with the transcript:

When the request completes, there will be four files associated with the uploaded video in your product environment:

  • If you also specify a language in the google_speech transcript request:
    - the request for format must be given before the language (e.g., google_speech:srt:vtt:ar-SA)
    - the transcript files will include the language and region code in the generated filename (e.g.,
  • While Google's speech recognition artificial intelligence algorithm is very powerful, no speech recognition tool is 100% accurate. If exact accuracy is important for your video, you can download the generated .transcript, .srt or .vtt file, edit them manually, and overwrite the original files.

    Important: Depending on your product environment setup, overwriting an asset may clear the tags, contextual, and structured metadata values for that asset. If you have a Master admin role, you can change this behavior for your product environment in the Media Library Preferences pane, so that these field values are retained when new version assets overwrite older ones (unless you specify different values for the tags, context, or metadata parameters as part of your upload).

Displaying transcripts as subtitle overlays

Cloudinary can automatically generate subtitles from the returned transcripts. To automatically embed subtitles with your video, add the subtitles property of the overlay parameter (l_subtitles in URLs), followed by the public ID to the raw transcript file (including the extension).

For example, the following URL delivers the public domain video of Lincoln's Gettysburg Address with automatically generated subtitles:

Formatting subtitle overlays

As with any subtitle overlay, you can use transformation parameters to make a variety of formatting adjustments when you overlay an automatically generated transcript file, including choice of font, font size, fill and outline color, and gravity.

For example, these subtitles are displayed using the Impact font, size 15, in a khaki color with a dark brown background, and located on the bottom left (south_west) instead of the default centered alignment:

Displaying transcripts as a separate track

Instead of embedded a transcript in your video as an overlay, you can alternatively add returned vtt or srt transcript files as a separate track for a video player. This way, the subtitles can be controlled (toggled on/off) separately from the video itself. For example, to add the video and transcript sources for an HTML5 video player:

If you're using the Cloudinary video player, you can add subtitles and captions as a separate text track by using the textTracks parameter.

✔️ Feedback sent!

Rate this page: