Google AI Video Transcription

Cloudinary is a cloud-based service that provides an end-to-end image and video management solution including uploads, storage, manipulations, optimizations and delivery. Cloudinary's video solution includes a rich set of video manipulation capabilities, including cropping, overlays, optimizations, and a large variety of special effects.

With the Google AI Video Transcription add-on, you can automatically generate speech-to-text transcripts of videos that you or your users upload to your account. The add-on applies powerful neural network models to your videos using Google's Cloud Speech API to get the best possible speech recognition results. The add-on supports translating videos in almost any language.

You can parse the contents of the returned transcript file to display the transcript of your video on your page, making your content more skimmable, accessible, and SEO-friendly.

When you deliver the videos, it only takes a single URL parameter to automatically insert the generated transcript into your video in the form of subtitles, which are exactly aligned to the timing of each spoken word. Alternatively, you can specify the (optionally) returned vtt or srt file as a video track so that users can toggle the subtitles on or off.

video with automatic transcript and subtitles

Requesting video transcription

To request a transcript for a video or audio file (in the default US English language), include the raw_convert parameter with the value google_speech in your upload or update call. (For other languages, see transcription languages below.)

For example:

Ruby:
Cloudinary::Uploader.upload("lincoln.mp4", 
  :resource_type => :video, 
  :raw_convert => "google_speech")
PHP:
\Cloudinary\Uploader::upload("lincoln.mp4", 
  array(
    "resource_type" => "video", 
    "raw_convert" => "google_speech"));
Python:
cloudinary.uploader.upload("lincoln.mp4",
  resource_type = "video", 
  raw_convert = "google_speech")
Node.js:
cloudinary.v2.uploader.upload("lincoln.mp4", 
  { resource_type: "video", 
    raw_convert: "google_speech" }),
  function(error, result) {console.log(result, error) });
Java:
cloudinary.uploader().upload("lincoln.mp4", 
  ObjectUtils.asMap(
    "resource_type", "video", 
    "raw_convert", "google_speech"));
.Net:
var uploadParams = new VideoUploadParams()
{
  File = new FileDescription(@"lincoln.mp4"),
  RawConvert = "google_speech"
};
var uploadResult = cloudinary.Upload(uploadParams);

The google_speech parameter value activates a call to Google's Cloud Speech API, which is performed asynchronously after your original method call is completed. Thus your original method call response displays a pending status:

...
"info": {   
   "raw_convert": {
      "google_speech": {
        "status": "pending"
      }
    }
 }
...

When the google_speech request is complete (may take several seconds or minutes depending on the length of the video), a new raw file is created in your account with the same public ID as your video or audio file and with the .transcript file extension. You can additionally request a standard subtitle format such as 'vtt' or 'srt'.

If you also provided a notification_url in your method call, the specified URL then receives a notification when the process completes:

{
  "info_kind":"google_speech",
  "info_status":"complete",
  "public_id":"lincoln",
  ...
}

Transcription languages

If your video/audio file is in a language other than US English, you can request transcription in the relevant language and (optionally) region/dialect.

For example, to request a video transcript in Canadian French when uploading the video abt_cloudinary_french.mp4:

Ruby:
Cloudinary::Uploader.upload("abt_cloudinary_french.mp4", 
  :resource_type => :video, 
  :raw_convert => "google_speech:fr-CA")
PHP:
\Cloudinary\Uploader::upload("abt_cloudinary_french.mp4", 
  array(
    "resource_type" => "video", 
    "raw_convert" => "google_speech:fr-CA"));
Python:
cloudinary.uploader.upload("abt_cloudinary_french.mp4",
  resource_type = "video", 
  raw_convert = "google_speech:fr-CA")
Node.js:
cloudinary.v2.uploader.upload("abt_cloudinary_french.mp4", 
  { resource_type: "video", 
    raw_convert: "google_speech:fr-CA" }),
  function(error, result) {console.log(result, error) });
Java:
cloudinary.uploader().upload("abt_cloudinary_french.mp4", 
  ObjectUtils.asMap(
    "resource_type", "video", 
    "raw_convert", "google_speech:fr-CA"));
.Net:
var uploadParams = new VideoUploadParams()
{
  File = new FileDescription(@"abt_cloudinary_french.mp4"),
  RawConvert = "google_speech:fr-CA"
};
var uploadResult = cloudinary.Upload(uploadParams);

You can specify just the 2 character language code or the full language + region code. For a full list of supported language and region codes, see the Google Cloud speech-to-text language support list.

Cloudinary transcript files

The created .transcript file includes details of the audio transcription, for example:

{
  "transcript": "four score and seven years ago",
  "confidence": 0.940843403339386,
  "words": [
    { "word": "four", "start_time": 1.6, "end_time": 2.1 },
    { "word": "score", "start_time": 2.1, "end_time": 2.6 },
    { "word": "and", "start_time": 2.6, "end_time": 2.7 },
    { "word": "seven", "start_time": 2.7, "end_time": 3.1 },
    { "word": "years", "start_time": 3.1, "end_time": 3.4 },
    { "word": "ago", "start_time": 3.4, "end_time": 3.7 }     
  ],
},
{
  "transcript": "our forefathers",
  "confidence": 0.933131217956543,
  "words": [

    { "word": "our", "start_time": 4.9, "end_time": 5.2 },
    { "word": "forefathers", "start_time": 5.2, "end_time": 6.0 }
  ],
},
{
  "transcript": .....

Each excerpt of text has a confidence value, and is followed by a breakdown of individual words and their specific start and end times.

Subtitle length and confidence levels

Google returns transcript excerpts of varying lengths. When displaying subtitles, long excerpts are automatically divided into 20 word entities and displayed on two lines.

You can also optionally set a minimum confidence level for your subtitles, for example: l_subtitles:my-video-id.transcript:90. In this case, any excerpt that Google returns with a lower confidence value will be omitted from the subtitles. Keep in mind that in some cases, this may exclude several sentences at once.

Generating standard subtitle formats

If you want to include the transcript as a separate track for a video player, you can also request that cloudinary create an SRT and/or WebVTT raw file by including the srt and/or vtt qualifiers (separated by a colon) with the google_speech value. For example, to upload a video and also request both srt and vrt files with the transcript:

Ruby:
Cloudinary::Uploader.upload("lincoln.mp4", 
  :resource_type => :video, 
  :raw_convert => "google_speech:srt:vtt")
PHP:
\Cloudinary\Uploader::upload("lincoln.mp4", 
  array(
    "resource_type" => "video", 
    "raw_convert" => "google_speech:srt:vtt"));
Python:
cloudinary.uploader.upload("lincoln.mp4",
  resource_type = "video", 
  raw_convert = "google_speech:srt:vtt")
Node.js:
cloudinary.v2.uploader.upload("lincoln.mp4",
  { resource_type: "video", 
    raw_convert: "google_speech:srt:vtt" }),
  function(error, result) {console.log(result, error) });
Java:
cloudinary.uploader().upload("lincoln.mp4", 
  ObjectUtils.asMap(
    "resource_type", "video", 
    "raw_convert", "google_speech:srt:vtt"));
.Net:
var uploadParams = new VideoUploadParams()
{
  File = new FileDescription(@"lincoln.mp4"),
  RawConvert = "google_speech:srt:vtt"
};
var uploadResult = cloudinary.Upload(uploadParams);

When the request completes, there will be 4 files associated with the uploaded video in your account:

.../video/upload/lincoln.mp4    // the source video
.../raw/upload/lincoln.transcript
.../raw/upload/lincoln.srt
.../raw/upload/lincoln.vtt

Notes

  • If you also specified a language in the google_speech transcript request, then the vtt and/or srt files you requested include the language and region code in the generated filename. For example: lincoln.fr-FR.vtt.
  • While Google's speech recognition artificial intelligence algorithm is very powerful, no speech recognition tool is 100% accurate. If exact accuracy is important for your video, you can download the generated .transcript, .srt or .vtt file, edit them manually, and overwrite the original files.

Displaying transcripts as subtitle overlays

Cloudinary can automatically generate subtitles from the returned transcripts. To automatically embed subtitles with your video, add the subtitles property of the overlay parameter (l_subtitles in URLs), followed by the public ID to the raw transcript file (including the extension).

For example, the following URL delivers the public domain video of Lincoln's Gettysburg Address with automatically generated subtitles:

Ruby:
cl_video_tag("lincoln", :overlay=>{:public_id=>"lincoln.transcript"})
PHP:
cl_video_tag("lincoln", array("overlay"=>array("public_id"=>"lincoln.transcript")))
Python:
CloudinaryVideo("lincoln").video(overlay={'public_id': "lincoln.transcript"})
Node.js:
cloudinary.video("lincoln", {overlay: {public_id: "lincoln.transcript"}})
Java:
cloudinary.url().transformation(new Transformation().overlay(new SubtitlesLayer().publicId("lincoln.transcript"))).videoTag("lincoln");
JS:
cloudinary.videoTag('lincoln', {overlay: new cloudinary.SubtitlesLayer().publicId("lincoln.transcript")}).toHtml();
jQuery:
$.cloudinary.video("lincoln", {overlay: new cloudinary.SubtitlesLayer().publicId("lincoln.transcript")})
React:
<Video publicId="lincoln" >
  <Transformation overlay={{publicId: "lincoln.transcript"}} />
</Video>
Angular:
<cl-video public-id="lincoln" >
  <cl-transformation overlay="subtitles:lincoln.transcript">
  </cl-transformation>
</cl-video>
.Net:
cloudinary.Api.UrlVideoUp.Transform(new Transformation().Overlay(new SubtitlesLayer().PublicId("lincoln.transcript"))).BuildVideoTag("lincoln")
Android:
MediaManager.get().url().transformation(new Transformation().overlay(new SubtitlesLayer().publicId("lincoln.transcript"))).resourceType("video").generate("lincoln.mp4");
iOS:
cloudinary.createUrl().setResourceType("video").setTransformation(CLDTransformation().setOverlay("subtitles:lincoln.transcript")).generate("lincoln.mp4")

Formatting subtitle overlays

As with any subtitle overlay, you can use transformation parameters to make a variety of formatting adjustments when you overlay an automatically generated transcript file, including choice of font, font size, fill and outline color, and gravity.

For example, these subtitles are displayed using the Impact font, size 15, in a khaki color with a dark brown background, and located on the bottom left (south_west) instead of the default centered alignment:

Ruby:
cl_video_tag("lincoln", :overlay=>{:font_family=>"impact", :font_size=>15, :public_id=>"lincoln.transcript"}, :color=>"khaki", :background=>"#331a00", :gravity=>"south_west")
PHP:
cl_video_tag("lincoln", array("overlay"=>array("font_family"=>"impact", "font_size"=>15, "public_id"=>"lincoln.transcript"), "color"=>"khaki", "background"=>"#331a00", "gravity"=>"south_west"))
Python:
CloudinaryVideo("lincoln").video(overlay={'font_family': "impact", 'font_size': 15, 'public_id': "lincoln.transcript"}, color="khaki", background="#331a00", gravity="south_west")
Node.js:
cloudinary.video("lincoln", {overlay: {font_family: "impact", font_size: 15, public_id: "lincoln.transcript"}, color: "khaki", background: "#331a00", gravity: "south_west"})
Java:
cloudinary.url().transformation(new Transformation().overlay(new SubtitlesLayer().fontFamily("impact").fontSize(15).publicId("lincoln.transcript")).color("khaki").background("#331a00").gravity("south_west")).videoTag("lincoln");
JS:
cloudinary.videoTag('lincoln', {overlay: new cloudinary.SubtitlesLayer().fontFamily("impact").fontSize(15).publicId("lincoln.transcript"), color: "khaki", background: "#331a00", gravity: "south_west"}).toHtml();
jQuery:
$.cloudinary.video("lincoln", {overlay: new cloudinary.SubtitlesLayer().fontFamily("impact").fontSize(15).publicId("lincoln.transcript"), color: "khaki", background: "#331a00", gravity: "south_west"})
React:
<Video publicId="lincoln" >
  <Transformation overlay={{fontFamily: "impact", fontSize: 15, publicId: "lincoln.transcript"}} color="khaki" background="#331a00" gravity="south_west" />
</Video>
Angular:
<cl-video public-id="lincoln" >
  <cl-transformation overlay="subtitles:impact_15:lincoln.transcript" color="khaki" background="#331a00" gravity="south_west">
  </cl-transformation>
</cl-video>
.Net:
cloudinary.Api.UrlVideoUp.Transform(new Transformation().Overlay(new SubtitlesLayer().FontFamily("impact").FontSize(15).PublicId("lincoln.transcript")).Color("khaki").Background("#331a00").Gravity("south_west")).BuildVideoTag("lincoln")
Android:
MediaManager.get().url().transformation(new Transformation().overlay(new SubtitlesLayer().fontFamily("impact").fontSize(15).publicId("lincoln.transcript")).color("khaki").background("#331a00").gravity("south_west")).resourceType("video").generate("lincoln.mp4");
iOS:
cloudinary.createUrl().setResourceType("video").setTransformation(CLDTransformation().setOverlay("subtitles:impact_15:lincoln.transcript").setColor("khaki").setBackground("#331a00").setGravity("south_west")).generate("lincoln.mp4")

Displaying transcripts as a separate track

Instead of embedded a transcript in your video as an overlay, you can alternatively add returned vtt or srt transcript files as a separate track for a video player. This way, the subtitles can be controlled (toggled on/off) separately from the video itself. For example, to add the video and transcript sources for an HTML5 video player:

<video crossorigin autobuffer controls autoplay>
  <source id="mp4" src="https://res.cloudinary.com/demo/video/upload/lincoln_speech.mp4" type="video/mp4">
     <track label="English" kind="subtitles" srclang="en" src="https://res.cloudinary.com/demo/raw/upload/lincoln_speech.vtt" default>
</video>