Google AI Video Transcription

Overview

Cloudinary is a cloud-based service that provides an end-to-end image and video management solution including uploads, storage, manipulations, optimizations and delivery. Cloudinary's video solution includes a rich set of video manipulation capabilities, including cropping, overlays, optimizations, and a large variety of special effects.

With the Google AI Video Transcription add-on, you can automatically generate speech-to-text transcripts of videos that you or your users upload to your account. The add-on applies powerful neural network models to your videos using Google's Cloud Speech API to get the best possible speech recognition results.

You can parse the contents of the returned transcript JSON file to display the transcript of your video on your page, making your content more skimmable, accessible, and SEO-friendly.

When you deliver the videos, it only takes a single URL parameter to automatically insert the generated transcript into your video in the form of subtitles, which are exactly aligned to the timing of each spoken word.

video with automatic transcript and subtitles

Requesting video transcription

To request a transcript of your video (or audio file), include the raw_convert parameter with the value google_speech in your upload, explicit, or update call.

For example:

Ruby:
Cloudinary::Uploader.upload("lincoln.mp4", 
  :resource_type => :video, :raw_convert => "google_speech")
PHP:
\Cloudinary\Uploader::upload("lincoln.mp4", 
  array("resource_type" => "video", "raw_convert" => "google_speech"));
Python:
cloudinary.uploader.upload("lincoln.mp4",
  resource_type = "video", raw_convert = "google_speech")
Node.js:
cloudinary.uploader.upload("lincoln.mp4", 
  function(result) { console.log(result); }, 
  { resource_type: "video", raw_convert: "google_speech" });
Java:
cloudinary.uploader().upload("lincoln.mp4", ObjectUtils.asMap(
  "resource_type", "video", "raw_convert", "google_speech"));

Notes:

  • The Video Transcription add-on currently supports only English language audio input.
  • You can include the notification_url parameter in your request to receive a notification to the specified URL when the transcript file is ready.

The google_speech parameter value activates a call to Google's Cloud Speech API, which is performed asynchronously after your original method call is completed. Thus your original method call response displays a pending status:

...
"info": {   
   "raw_convert": {
      "google_speech": {
        "status": "pending"
      }
    }
 }
...

When the google_speech request is complete (may take several seconds or minutes depending on the length of the video), a new raw file is created in your account with the same public ID as your video or audio file and the extension, .transcript.

If you also provided a notification_url in your method call, the specified URL then receives a notification when the process completes:

{"info_kind":"google_speech","info_status":"complete","public_id":"lincoln",.....}

The created .transcript file includes details of the audio transcription, for example:

 {
    "transcript": "four score and seven years ago",
    "confidence": 0.940843403339386,
    "words": [
      { "word": "four", "start_time": 1.6, "end_time": 2.1 },
      { "word": "score", "start_time": 2.1, "end_time": 2.6 },
      { "word": "and", "start_time": 2.6, "end_time": 2.7 },
      { "word": "seven", "start_time": 2.7, "end_time": 3.1 },
      { "word": "years", "start_time": 3.1, "end_time": 3.4 },
      { "word": "ago", "start_time": 3.4, "end_time": 3.7 }     
    ],
    "alternatives": [ ]
  },
  {
    "transcript": "our forefathers",
    "confidence": 0.933131217956543,
    "words": [

      { "word": "our", "start_time": 4.9, "end_time": 5.2 },
      { "word": "forefathers", "start_time": 5.2, "end_time": 6.0 }
    ],
    "alternatives": [ ]
  },
  {
    "transcript": .....

Each excerpt of text has a confidence value, and is followed by a breakdown of individual words and their specific start and end times.

Note: While Google's speech recognition artificial intelligence algorithm is very powerful, no speech recognition tool is 100% accurate. If exact accuracy is important for your video, you can download the generated .transcript file, edit it manually (make sure to edit the relevant word(s) in the transcript excerpt and in the word definition), and overwrite the original file. You can also add punctuation if desired.

Displaying Subtitles

The Google AI Video Transcription add-on can automatically generate subtitles from the returned transcripts. To include automatic subtitles with your video, add the subtitles property of the overlay parameter (l_subtitles in URLs), followed by the .transcript file public ID (including the extension).

For example, the following URL delivers the public domain video of Lincoln's Gettysburg Address with automatically generated subtitles:

Ruby:
cl_video_tag("lincoln", :overlay=>"subtitles:lincoln.transcript")
PHP:
cl_video_tag("lincoln", array("overlay"=>"subtitles:lincoln.transcript"))
Python:
CloudinaryVideo("lincoln").video(overlay="subtitles:lincoln.transcript")
Node.js:
cloudinary.video("lincoln", {overlay: "subtitles:lincoln.transcript"})
Java:
cloudinary.url().transformation(new Transformation().overlay("subtitles:lincoln.transcript")).videoTag("lincoln")
JS:
cl.videoTag('lincoln', {overlay: "subtitles:lincoln.transcript"}).toHtml();
jQuery:
$.cloudinary.video("lincoln", {overlay: "subtitles:lincoln.transcript"})
React:
<Video publicId="lincoln" >
  <Transformation overlay="subtitles:lincoln.transcript" />
</Video>
Angular:
<cl-video public-id="lincoln" >
  <cl-transformation overlay="subtitles:lincoln.transcript">
  </cl-transformation>
</cl-video>
.Net:
cloudinary.Api.UrlVideoUp.Transform(new Transformation().Overlay("subtitles:lincoln.transcript")).BuildVideoTag("lincoln")
Android:
MediaManager.get().url().transformation(new Transformation().overlay("subtitles:lincoln.transcript")).resourceType("video").generate("lincoln.mp4")

Subtitle length and confidence levels

Google returns transcript excerpts of varying lengths. When displaying subtitles, long excerpts are automatically divided into 20 word entities and displayed on two lines.

You can also optionally set a minimum confidence level for your subtitles, for example: l_subtitles:my-video-id.transcript:90. In this case, any excerpt that Google returns with a lower confidence value will be omitted from the subtitles. Keep in mind that in some cases, this may exclude several sentences at once.

Formatting subtitles

As with srt subtitle overlays, you can use transformation parameters to make a variety of adjustments, including choice of font, font size, fill and outline color, and gravity

For example, these subtitles are displayed using the Impact font, size 15, in a khaki color with a dark brown background, and located on the bottom left (south_west) instead of the default centered alignment.

For example:

Ruby:
cl_video_tag("lincoln", :overlay=>"subtitles:impact_15:lincoln.transcript", :color=>"khaki", :background=>"#331a00", :gravity=>"south_west")
PHP:
cl_video_tag("lincoln", array("overlay"=>"subtitles:impact_15:lincoln.transcript", "color"=>"khaki", "background"=>"#331a00", "gravity"=>"south_west"))
Python:
CloudinaryVideo("lincoln").video(overlay="subtitles:impact_15:lincoln.transcript", color="khaki", background="#331a00", gravity="south_west")
Node.js:
cloudinary.video("lincoln", {overlay: "subtitles:impact_15:lincoln.transcript", color: "khaki", background: "#331a00", gravity: "south_west"})
Java:
cloudinary.url().transformation(new Transformation().overlay("subtitles:impact_15:lincoln.transcript").color("khaki").background("#331a00").gravity("south_west")).videoTag("lincoln")
JS:
cl.videoTag('lincoln', {overlay: "subtitles:impact_15:lincoln.transcript", color: "khaki", background: "#331a00", gravity: "south_west"}).toHtml();
jQuery:
$.cloudinary.video("lincoln", {overlay: "subtitles:impact_15:lincoln.transcript", color: "khaki", background: "#331a00", gravity: "south_west"})
React:
<Video publicId="lincoln" >
  <Transformation overlay="subtitles:impact_15:lincoln.transcript" color="khaki" background="#331a00" gravity="south_west" />
</Video>
Angular:
<cl-video public-id="lincoln" >
  <cl-transformation overlay="subtitles:impact_15:lincoln.transcript" color="khaki" background="#331a00" gravity="south_west">
  </cl-transformation>
</cl-video>
.Net:
cloudinary.Api.UrlVideoUp.Transform(new Transformation().Overlay("subtitles:impact_15:lincoln.transcript").Color("khaki").Background("#331a00").Gravity("south_west")).BuildVideoTag("lincoln")
Android:
MediaManager.get().url().transformation(new Transformation().overlay("subtitles:impact_15:lincoln.transcript").color("khaki").background("#331a00").gravity("south_west")).resourceType("video").generate("lincoln.mp4")