Accessibility is a core factor of software development, and for websites, video transcriptions improve the user experience. Video transcriptions are instrumental in many cases, such as for viewers with hearing impairments, those in noisy environments, language learners following along with lectures, and people who prefer to consume information by reading.
In this blog post, we’ll learn how to generate video transcription into our Next.js applications using Cloudinary’s AI features.
To follow along, you’ll need a basic understanding of TypeScript or JavaScript and a Cloudinary account; you can create one for free. The completed app can be found in this GitHub Repository.
Run the command below to setup a Next.js app. You can give it any name you wish:
$ npx create-next-app@latest <YOUR-APPLICATION-NAME>
Code language: HTML, XML (xml)
Select the options below in the process of setting up the app, or you can choose whatever setting works for you:
Install the Cloudinary package with the command below:
$ npm i cloudinary
Change the directory into the application we created above and navigate to the page.module.css
file. Replace the boilerplate code with the code below. These are just the basic CSS styles we would use in our app:
.main {
display: flex;
flex-direction: column;
justify-content: space-evenly;
align-items: center;
padding: 6rem;
height: 100vh;
}
.main form {
background-color: white;
width: 400px;
border-radius: 5px;
padding: 10px;
}
.main form label {
color: black;
}
.main form input {
padding: 0.5rem;
border: 1px solid #ccc;
border-radius: 4px;
background: none;
color: black;
margin: 0 10px;
}
.main form button {
height: 40px;
width: 100%;
margin: 20px 0;
color: white;
font-size: large;
border-radius: 5px;
border: none;
background-color: rgb(28, 25, 25);
}
.main form button:hover {
cursor: pointer;
background-color: rgb(85, 82, 82);
}
.main form button:disabled {
background-color: #ccc;
color: #666;
cursor: not-allowed;
opacity: 0.6;
}
.video-transcription-section {
display: flex;
flex-direction: row;
justify-content: space-evenly;
align-items: center;
padding: 3rem;
width: 100vw;
height: 350px;
}
.video-transcription-section video {
height: inherit;
border-radius: 5px;
border: #ccc solid 1px;
}
.transcription {
border-radius: 5px;
border: #ccc solid 1px;
height: inherit;
width: 500px;
padding: 5px;
color: lightgreen;
}
Code language: CSS (css)
Next, in the page.tsx
file, replace the boilerplate code with the code below. This code snippet is the app’s foundation which stores required data and eventually renders a transcript:
'use client';
import { useState } from 'react';
import styles from './page.module.css';
export default function Home() {
const [videoUrl, setVideoUrl] = useState<string>('');
const [transcript, setTranscript] = useState<string>('');
const [isUploading, setIsUploading] = useState<boolean>(false);
const uploadVideo = async (event: React.FormEvent<HTMLFormElement>) => {}
return (
<main className={styles.main}>
<form onSubmit={uploadVideo}>
<label htmlFor='video_file'>Video file:</label>
<input type='file' name='video_file' accept='video/*' required />
<button type='submit' disabled={isUploading}>
{isUploading ? 'Uploading video...' : 'Upload'}
</button>
</form>
{videoUrl && (
<div className={styles['video-transcription-section']}>
<video crossOrigin='anonymous' controls muted>
<source src={videoUrl} type='video/mp4' />
</video>
<div className={styles.transcription}>
<p>
{transcript ? transcript : 'Transcription is being processed...'}
</p>
</div>
</div>
)}
</main>
);
}
Code language: JavaScript (javascript)
In the code above, we initialized the state variables we’d need:
videoUrl
. This variable keeps track of the uploaded video’s URL.transcript
. This variable stores the transcribed text associated with the uploaded video.isUploading
. This boolean flag indicates whether a video upload is in progress.
There is also an empty uploadVideo function that will contain our upload logic. We conditionally show the video and transcript based on their respective state variables.
In the project root directory, add a new file, .env.local
, and paste the code below in that file. This file would house all the app keys we need to use Cloudinary in our project.
CLOUDINARY_CLOUD_NAME=<CLOUDINARY_CLOUD_NAME>
CLOUDINARY_API_KEY=<CLOUDINARY_API_KEY>
CLOUDINARY_API_SECRET=<CLOUDINARY_API_SECRET>
Code language: HTML, XML (xml)
To get these details, navigate to the Cloudinary developer dashboard.You should see all the details under the Product Environment Credentials.
We must provision our Cloudinary account with the Google AI Transcription Add-on to use the video transcription feature. Navigate to the Add-ons page on the Cloudinary console. We need to be logged in to our Cloudinary account first to view the page:
Click the Google AI Video Transcription card to see this page that lists the available plans.
We can subscribe to any of the available plans. The plan we subscribed to will show up on our dashboard, which signifies that our account now has the Cloudinary video transcription feature. For this article, we subscribed to the free plan.
Now that we’re all set up, let’s start building the application’s functionality. Let’s establish a few utilities we’ll need later on in the app.
First is the Cloudinary library. To do this:
- Add a new folder called lib in the src folder.
- In that folder, add a file and call it cloudinary.ts.
- In the cloudinary.ts file, paste the code below:
import { v2 as cloudinary } from 'cloudinary';
cloudinary.config({
cloud_name: process.env.CLOUDINARY_CLOUD_NAME,
api_key: process.env.CLOUDINARY_API_KEY,
api_secret: process.env.CLOUDINARY_API_SECRET,
});
export default cloudinary;
Code language: JavaScript (javascript)
In the code above, we import the Cloudinary library. The v2 version is being used and aliased as cloudinary. The cloudinary.config() method is called to configure the Cloudinary client with the account details we retrieved from our dashboard in the last section.
We then export the configured Cloudinary object as this module’s default export. This setup allows other parts of our application to import and use the configured Cloudinary client without configuring it again.
Next is the transcript parser. This utility function takes in the transcript data and parses it into a paragraph of text that we can display.
- In the lib folder, add a file and call it transcript.ts.
- In the transcript.ts file, paste the code below:
import { TranscriptData } from '@/types/transcript-data.type';
export const parseTranscriptData = async (
data: TranscriptData[]
): Promise<string> => {
let transcript: string = '';
data.forEach(
(line: TranscriptData) => (transcript = transcript + ` ${line.transcript}`)
);
return transcript;
};
Code language: JavaScript (javascript)
The TranscriptData
interface defines the structure of the transcription file we get from Cloudinary. Our app doesn’t currently have the TranscriptData
interface defined. To do that, we need to create a new folder in the src
directory and name it types. In the types folder, add a new file named transcript-data.type.ts
.
Add the code below in the transcript-data.type.ts
file:
export interface Word {
word: string;
start_time: number;
end_time: number;
}
export interface TranscriptData {
transcript: string;
confidence: number;
words: Word[];
}
Code language: CSS (css)
In the uploadVideo function in the page.tsx
file, we want to set up the logic to upload our video to Cloudinary.
const uploadVideo = async (event: React.FormEvent<HTMLFormElement>) => {
event.preventDefault();
setIsUploading(true);
const formData = new FormData(event.currentTarget);
try {
const response = await fetch('/api/upload', {
method: 'POST',
body: formData,
});
if (!response.ok) {
throw new Error('Failed to upload video');
}
const data = await response.json();
setVideoUrl(data.videoUrl);
setTranscript('');
} catch (error: any) {
console.error(error);
} finally {
setIsUploading(false);
}
};
Code language: JavaScript (javascript)
In this function, we get the data from our form and make a post request to the upload API endpoint, a serverless function, which we will create in a second. Then, we handle the response by throwing an error if the upload failed and setting the videoUrl
and transcript
state variables otherwise.
To set up the upload endpoint, follow these steps:
- Create a new folder in the app called
api
. - In the
api
folder, create a new folder calledupload
. - In the
upload
folder, add a file calledroute.ts
. - In the
route.ts
, paste the code below:
import { UploadApiResponse } from 'cloudinary';
import cloudinary from '@/lib/cloudinary';
import { NextResponse } from 'next/server';
export async function POST(req: Request, res: NextResponse) {
try {
const formData = await req.formData();
const file = formData.get('video_file') as File;
const buffer: Buffer = Buffer.from(await file.arrayBuffer());
const cloud_name: string | undefined = process.env.CLOUDINARY_CLOUD_NAME;
const base64Image: string = `data:${file.type};base64,${buffer.toString(
'base64'
)}`;
const uploadResult: UploadApiResponse = await cloudinary.uploader.upload(
base64Image,
{
resource_type: 'video',
public_id: `videos/${Date.now()}`,
raw_convert: 'google_speech',
}
);
const videoUrl = uploadResult.secure_url;
const transcriptionFileUrl = `https://res.cloudinary.com/${cloud_name}/raw/upload/v${uploadResult.version + 1}/${uploadResult.public_id}.transcript`;
return NextResponse.json(
{ videoUrl, transcriptionFileUrl },
{ status: 200 }
);
} catch (error: any) {
throw new Error(error);
}
}
Code language: JavaScript (javascript)
In the file, we retrieve the video file from the form data, cast it as a File object, and then convert the video file into a binary Buffer by reading the file’s content as an ArrayBuffer. Subsequently, we convert this Buffer into a Buffer object using Buffer.from(). This conversion is necessary to efficiently handle the video’s binary data on the server.
Next, we convert the video file from a Buffer object to a Base64-encoded string using the file’s MIME type and binary data. The Buffer object is transformed into a Base64 string with Buffer.toString('base64')
, then prefixed with the data URI scheme to create a valid Base64 image string for Cloudinary upload.
Next, we upload the Base64-encoded video to Cloudinary using the upload method we set up from the Cloudinary client (cloudinary.uploader.upload
). This method takes two main parameters: the Base64 string of the video and an options object. Within the options object, the resource_type is set to video
to indicate that the file being uploaded is a video.
The raw_convert
option tells Cloudinary to generate a transcript file(.transcript) for the uploaded video using Google Speech-to-Text. Depending on the length of the video, the transcript file may take several seconds or minutes to generate and to handle this, we would be implementing system design method called polling later on in this article.
Finally, we construct the transcription file URL by combining the cloud_name
, version
, and public_id
. This URL points to the transcription file hosted on Cloudinary. Notice we add one to the version number in the transcript file construct(uploadResult.version + 1)
. This is because the transcript file takes some time to process after the video has been uploaded, and then Cloudinary increases the file version number by one which indicates that the file has finished processing. This is the URL we’ll use when polling for the transcript file.
Currently, we can upload our video to Cloudinary and set it to show up in our application. Still, we can’t view the transcript of the uploaded video. This is because, after our video uploads, the google_speech
parameter value asynchronously activates a call to Google’s Cloud Speech API, which has an initial pending state. The time it takes for the transcript file to finish generating depends on the length of the uploaded video.
To handle this, we will use the polling method to periodically check Cloudinary for the status of the transcript file generation. Polling is a method used in system design to continuously check the status or retrieve data from a source at predefined intervals.
We need to add a new function to our page.tsx
file called checkTranscriptionStatus
. We’ll add it just after the state variables:
...
const POLLING_INTERVAL = 5000;
export default function Home() {
...
const checkTranscriptionStatus = async (url: string) => {
try {
const response = await fetch(
`/api/transcript?url=${encodeURIComponent(url)}`
);
const data = await response.json();
if (data.available) {
setTranscript(data.transcript);
} else {
setTimeout(() => checkTranscriptionStatus(url), POLLING_INTERVAL);
}
} catch (error: any) {
console.error('Error checking transcription status:', error);
}
};
...
return (
...
Code language: JavaScript (javascript)
At the top, we’ll define the polling interval to 5000 milliseconds. In the checkTranscriptionStatus
function, we’ll make a GET request to the transcript API endpoint — we’ll create the endpoint in a moment — and send the transcript URL as a query param.
We’ll then check if the transcript is available using the available param in our endpoint’s response data and set the transcript data to the state. If it isn’t yet available, we’ll define a setTimeout
to call the API again in five seconds (5000 milliseconds). This will keep checking Cloudinary to retrieve the transcript data.
To set up the upload endpoint, we’ll follow these steps:
- In the
api
folder, create a new folder calledtranscript
. - In the transcript folder, add a file called
route.ts
. - In the
route.ts file
, paste the code below.
import { parseTranscriptData } from '@/lib/transcript';
import { TranscriptData } from '@/types/transcript-data.type';
import { NextResponse, type NextRequest } from 'next/server';
export async function GET(req: NextRequest) {
const searchParams = req.nextUrl.searchParams;
const url: string | null = searchParams.get('url');
try {
const response = await fetch(url as string);
if (response.ok) {
const transcriptData: TranscriptData[] = await response.json();
const transcript: string = await parseTranscriptData(transcriptData);
return NextResponse.json(
{ available: true, transcript },
{ status: 200 }
);
} else {
return NextResponse.json({ available: false }, { status: 200 });
}
} catch (error: any) {
throw new Error(error);
}
}
Code language: JavaScript (javascript)
In the code above, we import the parseTranscriptDat
a method and the TranscriptData
interface we created in the utils section earlier, and then in the function, we extract the URL parameter from the query string of the incoming request, and then we attempt to fetch data from this URL. The response from the fetch is checked for a successful status (response.ok). If successful, we parse the JSON response into the transcriptData array and then process this data into a string format using the parseTranscriptData function. The resulting transcript is returned with the available status set to true.
If the fetch operation does not return a successful response, the function returns a JSON response with available set to false. Our app will keep polling the /api/transcript
endpoint until the transcript file has been generated.
When we run the app, the video and transcript section won’t show up initially until after the video has successfully uploaded. The transcript section would show a message saying “Transcription is being processed…”, this means our app is still trying to fetch the transcript from Cloudinary and once the transcript is successfully generated, it will replace the default message.
In this blog post, we discussed how to use the Google AI Transcription Add-on provided by Cloudinary to generate video transcriptions in a Next.js application. Try adding more features to the app like displaying the timestamp for each transcribed paragraph. To learn more about Cloudinary AI, contact us today.
If you found this blog post helpful and want to discuss it in more detail, join the Cloudinary Community forum and its associated Discord.