Imagine managing your entire media library not by clicking through folders and forms, but by having a conversation. Instead of manually searching, selecting, and tagging assets, you could simply type: “Find all images from the ‘summer-sale’ folder and tag them with ‘archive-2025’.” This is the future of digital asset management: an intelligent, conversational interface that works like a co-pilot for your creative workflow.
This platform provides a powerful chat interface that understands your commands, manages your media, and even handles new uploads. It’s all powered by a cutting-edge, full-stack setup: a Next.js frontend, OpenAI for natural language understanding, and the Cloudinary Model Context Protocol (MCP) Server acting as the bridge between them.
In this tutorial, you’ll build a complete AI-powered media assistant that can:
- Launch a local Cloudinary MCP gateway that exposes powerful asset management tools in a way that AI models can understand.
- Provide a polished Next.js chat interface for uploading files and interacting with your media library.
- Process natural language commands to list, rename, move, tag, and delete assets in your Cloudinary account.
- Integrate with OpenAI to intelligently interpret user requests and call the appropriate Cloudinary tools.
Let’s dive in!
Before we dive into the code, we need to get the project running on your local machine. This involves cloning the starter repository, installing the necessary packages, and configuring your environment with the required API keys for Cloudinary and OpenAI.
We’ll start by cloning the complete project from GitHub. This gives us the full application structure right away.
git clone https://github.com/musebe/cloudinary-mcp-media-assistant.git
cd cloudinary-mcp-media-assistant
Code language: PHP (php)
Next, install all the required Node.js packages using npm.
npm install
This will install Next.js, React, the Cloudinary and OpenAI SDKs, and other utilities defined in the package.json file.
The assistant needs to connect to your specific Cloudinary and OpenAI accounts. We’ll store these secret keys in a local environment file that should never be committed to version control.
Create a new file named .env.local in the root of your project and add the following content:
# Get this from your Cloudinary Dashboard homepage
# Format: cloudinary://API_KEY:API_SECRET@CLOUD_NAME
CLOUDINARY_URL="your_cloudinary_url"
# Get this from platform.openai.com/api-keys
OPENAI_API_KEY="sk-..."
# Port for the local MCP server (optional, defaults to 8787)
MCP_PORT=8787
Code language: PHP (php)
-
CLOUDINARY_URL. This single URL contains your Cloud Name, API Key, and API Secret. You can find it on your main Cloudinary Dashboard. -
OPENAI_API_KEY. This is required to give your assistant its “brain.” You can generate a new key from your OpenAI API Keys page. The AI features are optional, but this key is needed to run the app as-is.
Important: Your
.env.localfile contains sensitive credentials. The project’s.gitignorefile is already configured to exclude it, but always ensure you don’t accidentally expose your keys.
With the setup complete, we’re ready to start the engine.
At the heart of our application is the Cloudinary MCP Gateway. Before we can build a chat interface, we need to run this local server. It acts as a crucial translator, converting Cloudinary’s powerful Asset Management API into a standardized format that AI models can understand and interact with. This format is called the Model Context Protocol (MCP).
Instead of a traditional REST API, the MCP server exposes functions like list-images or delete-asset as “tools” that an AI can be instructed to use.
The project includes a custom script to make starting this server simple. It handles configuration, starts the process, and checks that it’s running correctly before finishing.
Open a new, dedicated terminal window and run the following command:
npm run dev:mcp
Code language: CSS (css)
After a few moments, you should see output confirming that the server is active and listening for connections, ending with a success message.
🔹 Starting Cloudinary MCP Gateway...
Using CLOUDINARY_URL: //***:***@your_cloud_name
Spawning process...
🩺 Checking gateway health at http://localhost:8787/sse...
✅ Gateway is healthy and running!
Keep this terminal window open. This server must be running for the chat application to function.
This isn’t just a simple server command. It’s a robust launcher. Let’s look at two important parts of the scripts/start-mcp-asset.ts file.
First, the core command uses npx to run two packages together.
// scripts/start-mcp-asset.ts (snippet)
const cmd = "npx";
const args = [
"-y",
"supergateway",
// ... port and path flags ...
"--stdio",
"npx -y --package @cloudinary/asset-management -- mcp start",
];
const child = spawn(cmd, args, {
/* ... */
});
Code language: JavaScript (javascript)
Breakdown:
-
supergateway. Utility that creates an MCP-compliant server from another process. -
-stdio '...'. Wraps the standard output of another command. -
npx ... mcp start. Runs the official Cloudinary asset management tools. By wrapping it, the gateway discovers all available Cloudinary functions (list-images,upload-asset,asset-rename, etc.) and exposes them as MCP tools.
Next, the script includes a health check to ensure the server is actually ready before our app connects.
// scripts/start-mcp-asset.ts (snippet)
async function checkHealth(): Promise<boolean> {
const startTime = Date.now();
while (Date.now() - startTime < HEALTH_CHECK_TIMEOUT) {
try {
// The /sse path is the Server-Sent Events endpoint
const response = await fetch(HEALTH_CHECK_URL, { method: "GET" });
if (response.ok) {
return true; // Success!
}
} catch {
// Ignore errors, server is still starting
}
await new Promise((resolve) => setTimeout(resolve, HEALTH_CHECK_INTERVAL));
}
return false; // Timeout
}
Code language: JavaScript (javascript)
This loop prevents race conditions and ensures a smooth developer experience.
For a deeper dive into how the protocol works, check out the official Cloudinary MCP Documentation.
With our engine running, it’s time to build the cockpit.
With our MCP server running, we need an interface for our conversation. The “cockpit” of our application is a clean chat UI built with Next.js, responsible for displaying messages and capturing user input.
This component orchestrates the UI, primarily by managing and displaying the list of messages. Its main role is to map over the message state and render the appropriate components for each one.
// src/components/chat/chat-container.tsx (Conceptual Snippet)
// The container's core job is to render the list of messages.
<ScrollArea>
{optimisticMessages.map((m) => (
<MessageBubble role={m.role}>
{/* Renders text, AssetLists, etc. inside */}
</MessageBubble>
))}
</ScrollArea>;
Code language: PHP (php)
See the full component with state management on GitHub.
When the assistant returns media assets, this component renders them in a rich list. Its most important job is displaying a thumbnail and key information for each asset.
// src/components/chat/asset-list.tsx (Snippet)
// A simplified view of how an asset is displayed
<div className="flex items-start gap-3">
<Image src={item.thumbUrl} alt="..." width={56} height={56} />
<div className="font-medium" title={item.id}>
{item.id}
</div>
</div>;
Code language: HTML, XML (xml)
See the full component on GitHub.
This form handles both text and file inputs. The key is a visually hidden input type="file" that is triggered by a button click, providing a clean UI for two types of actions.
// src/components/chat/chat-input.tsx (Snippet)
// The mechanism for dual text/file input
<input ref={fileInputRef} type="file" className="hidden" />
<Button onClick={() => fileInputRef.current?.click()}>
<Paperclip />
</Button>
<Input placeholder="Type a message or upload..." />
Code language: HTML, XML (xml)
See the full component on GitHub.
With the user interface in place, we now need to wire it up to our backend logic.
Now that we have a UI and a server, we need to connect them. Instead of building traditional API routes, we’ll use a modern Next.js feature: Server Actions. These are special functions that run on the server but can be called directly from our client components, making form submissions and data mutations simple and secure.
In our ChatContainer.tsx component, we use the useActionState hook from React. This hook is designed to work seamlessly with Server Actions.
// src/components/chat/chat-container.tsx (Snippet)
"use client";
import { useActionState } from "react";
import { sendMessageAction } from "@/app/(chat)/actions";
// ...
export function ChatContainer() {
const [messages, formAction, isPending] = useActionState(
sendMessageAction, // Our Server Action
[] // The initial state (an empty message list)
);
// ...
}
Code language: JavaScript (javascript)
This line does three things:
-
messages. Provides the updated list of chat messages returned by the action. -
formAction. Exposes a function to trigger the action, which we pass toChatInput. -
isPending. Returns a boolean loading state, used to show a “typing” bubble while the server processes the request.
See the full implementation on GitHub.
The sendMessageAction function lives in src/app/(chat)/actions.ts. The file starts with a 'use server'; directive, which enables the feature. The function accepts form data and the previous state from the hook.
// src/app/(chat)/actions.ts (Snippet)
'use server';
import type { ChatMessage } from '@/types';
export async function sendMessageAction(
previousState: ChatMessage[] | null,
formData: FormData
): Promise<ChatMessage[]> {
// 1. Get user input from formData.
const text = formData.get('text') as string;
const file = formData.get('file');
// 2. Connect to the MCP server.
// 3. Call the correct tool based on the input.
// 4. Return the new message list.
// ... The full implementation follows ...
}
Code language: HTML, XML (xml)
This function is the central hub of the application’s logic. It reads the FormData object from the client, determines whether the user uploaded a file or typed a command, and then calls the MCP server.
See the full action file on GitHub.
Now that the communication channel is open, we can implement the logic for handling specific user commands.
Our Server Action is the central command hub. Now we’ll implement the logic that interprets user commands and translates them into actions for our MCP server to execute. This involves a three-step process: parse intent, call an operation, and execute the MCP tool.
Inside sendMessageAction, we use regular expressions to understand the user’s text. This is a fast and effective way to handle command-line-style instructions.
// src/app/(chat)/actions.ts (Snippet)
// A few examples of the intent-matching regex
const wantList = /^(list|show)\s+images/i.test(text);
const renameMatch = text.match(/^rename\s+(.+?)\s+to\s+(.+)$/i);
const deleteMatch = text.match(/^delete\s+(.+)$/i);
const tagMatch = text.match(/^tag\s+(.+?)\s+with\s+(.+)$/i);
Code language: JavaScript (javascript)
Based on which expression matches, we enter a specific block of logic to handle that command.
See the full list of matchers in the action file on GitHub.
Here’s the handler for the list images command. It calls a dedicated operation function (listImages) and then uses the result to build the assistant’s response.
// src/app/(chat)/actions.ts (Snippet)
// This block runs if the `wantList` regex matches
if (wantList) {
// 2a. Call the dedicated operation function
const assets = await listImages(client);
// 2b. Build the assistant's reply object
assistantMsg = {
id: crypto.randomUUID(),
role: "assistant",
text: assets.length ? "Here are your latest images:" : "No images found.",
assets: assets || undefined, // This data is sent to the AssetList component
};
}
Code language: JavaScript (javascript)
If assets are returned, they’re attached to the assets property. The frontend AssetList component is designed to automatically render this data.
The final step happens inside the operation functions in src/lib/mcp-ops.ts. These functions handle direct communication with the MCP server. The listImages function looks like this:
// src/lib/mcp-ops.ts (Snippet)
export async function listImages(client: MCPClient): Promise<AssetItem[]> {
// 3a. Call the specific tool by name
const res = await client.callTool({ name: "list-images", arguments: {} });
if (res.isError) {
throw new Error("list-images failed");
}
// 3b. Parse the raw JSON into our standardized AssetItem type
return toAssetsFromContent(res?.content || []) || [];
}
Code language: HTML, XML (xml)
Here, client.callTool({ name: 'list-images' }) sends the command to the MCP Gateway, which executes the corresponding Cloudinary function. The raw JSON response is then parsed by toAssetsFromContent into a clean format for the UI.
Other commands like rename, tag, and delete follow the same pattern, keeping intent parsing and tool execution separate.
See all operation functions on GitHub.
Handling file uploads in a chat interface requires a smooth flow from the browser to the cloud. Our assistant doesn’t just upload a file; it processes it via the MCP server and immediately responds with the resulting asset, creating an interactive experience.
Let’s follow the journey of a file from the user’s click to the final response.
It starts in the ChatInput.tsx component. When the user selects a file, the input’s onChange event fires, which immediately calls the onSend function from the parent container.
// src/components/chat/chat-input.tsx (Snippet)
function handleFileChange(event: React.ChangeEvent<HTMLInputElement>) {
const file = event.target.files?.[0];
if (file) {
onSend({ file }); // Kicks off the entire upload process
}
}
Code language: JavaScript (javascript)
This action packages the File object and sends it straight to our Server Action.
See the full component on GitHub.
Our sendMessageAction checks for a file before looking for text commands. This gives uploads priority.
// src/app/(chat)/actions.ts (Snippet)
// This block is at the top of our action's logic
if (file instanceof File) {
// Call our dedicated upload operation
const uploaded = await uploadFileToFolder(client, file, "chat_uploads");
// Build a response message containing the new asset
assistantMsg = {
id: crypto.randomUUID(),
role: "assistant",
text: "Image uploaded successfully.",
assets: uploaded ? [uploaded] : undefined, // Send asset data back to the UI
};
return [...currentState, userMessage, assistantMsg];
}
Code language: JavaScript (javascript)
After calling uploadFileToFolder, the function returns an assistantMsg that includes the uploaded asset. This lets the UI show the result instantly.
The last step is in src/lib/mcp-ops.ts. The uploadFileToFolder function prepares the file and calls the correct MCP tool. Files can’t be sent directly as JSON, so we first encode them as base64 data URIs.
// src/lib/mcp-ops.ts (Snippet)
export async function uploadFileToFolder(
client: MCPClient,
file: File
): Promise<AssetItem | null> {
// 1. Convert the file into a text-based data URI
const buffer = Buffer.from(await file.arrayBuffer());
const dataUri = `data:${file.type};base64,${buffer.toString("base64")}`;
// 2. Call the 'upload-asset' tool with the data URI
const res = await client.callTool({
name: "upload-asset",
arguments: {
uploadRequest: {
file: dataUri,
fileName: file.name,
folder: "chat_uploads",
},
},
});
// 3. Parse the JSON response from Cloudinary
return parseUploadResult(res?.content) || null;
}
Code language: JavaScript (javascript)
client.callTool({ name: 'upload-asset' }) instructs the MCP gateway to upload the file. The gateway handles Cloudinary communication and returns the new asset’s details, which we parse and send back to the user.
See the full upload operation on GitHub.
Our regex-based command handler is fast and effective for specific commands, but it’s rigid. If a user types “show me my pictures” instead of “list images,” our current logic fails. To make the assistant truly smart, we can integrate an OpenAI model to understand natural language and decide which MCP tool to use.
This transforms the application from a command-line interface into a conversational assistant.
The difference lies in intent detection:
- Regex looks for an exact pattern.
-
AI understands meaning. For example, “can you get rid of the picture named ‘test’?” can be mapped to the
delete-assettool, something regex cannot reliably do.
The logic is in src/lib/ai-router.ts. The idea is to present the entire MCP server to OpenAI as a single tool that the model can use.
The OpenAI SDK supports type: 'mcp' for this purpose.
// src/lib/ai-router.ts (Snippet)
// Describe our MCP server to the OpenAI client
const mcpTool = {
type: 'mcp',
server: { type: 'sse', url: 'http://localhost:8787/sse' },
} as ResponsesTool;
// Make the API call
const resp = await openai.responses.create({
model: 'gpt-4o', // Or your preferred model
input: [
{ role: 'system', content: 'You are a Cloudinary asset assistant. Prefer calling MCP tools...' },
{ role: 'user', content: userText },
],
tools: [mcpTool],
tool_choice: 'auto',
});
Code language: JavaScript (javascript)
What happens here:
- Define
mcpTool, pointing the SDK to the MCP gateway. - Call
openai.responses.createwith a system prompt, the user’s message, and the MCP tool definition. - The model analyzes the text and, if needed, calls the correct MCP tool.
- The API response contains both a text reply and any data from the tool call.
See the full AI router implementation on GitHub: src/lib/ai-router.ts
To switch from regex to AI, modify the else block in src/app/(chat)/actions.ts. Instead of returning a help message, let the AI handle text intent.
// src/app/(chat)/actions.ts (Conceptual Upgrade)
import { askOpenAIWithMCP } from "@/lib/ai-router";
// ... inside sendMessageAction ...
if (file instanceof File) {
// Keep direct handling for file uploads
// ...
} else {
// Let the AI router handle ALL text-based intents
const { text: replyText, assets } = await askOpenAIWithMCP(text);
assistantMsg = {
id: crypto.randomUUID(),
role: "assistant",
text: replyText,
assets: assets,
};
}
Code language: JavaScript (javascript)
With this change, the assistant gains far greater flexibility and intelligence.
Functionality is key, but the user experience is what makes an application feel great. Regex-based commands return accurate but robotic responses like “Deleted my-image-id.”. We can improve this by adding a second, lightweight AI call whose job is to make the assistant sound more human.
This is a powerful pattern: use one system (regex or a large AI model) to decide what to do, and a second, faster AI model to decide how to say it.
In actions.ts, the response text is constructed programmatically.
// A typical response from our regex-based system
assistantMsg = {
// ...
text: `Created folder “${folderPath}”.`,
};
Code language: JavaScript (javascript)
This is clear, but not conversational.
The src/lib/ai-guide.ts file introduces a helper function, generateFriendlyReply. This function takes the hardcoded default text and uses a fast AI model (like gpt-4o-mini) to rewrite it in a warmer, more helpful way.
The improvement comes from carefully designed prompts.
System prompt sets the tone:
// src/lib/ai-guide.ts (System Prompt Snippet)
function buildSystemPrompt() {
return [
"You are a warm, concise product guide for a Cloudinary MCP chat.",
"Tone: friendly, encouraging, not robotic.",
"Keep answers short (1–3 sentences).",
"Never invent features.",
// ...
].join(" ");
}
Code language: JavaScript (javascript)
User prompt gives context:
// src/lib/ai-guide.ts (User Prompt Snippet)
function buildUserPrompt(input: GuideInput) {
return [
`User said: "${input.userText}"`,
`Assistant’s default reply (must keep meaning): "${input.defaultText}"`,
"Rewrite the default reply to be more conversational and helpful...",
].join("\n");
}
Code language: JavaScript (javascript)
This ensures the AI rephrases the default reply without losing accuracy.
See the full prompt design on GitHub: src/lib/ai-guide.ts
To activate this feature, wrap static text assignments in actions.ts with a call to the helper.
Before: Hardcoded response.
// src/app/(chat)/actions.ts (Original)
if (createFolderMatch) {
const folderPath = createFolderMatch[2].trim();
const ok = await createFolderOp(client, folderPath);
assistantMsg = {
// ...
text: ok
? `Created folder “${folderPath}”.`
: `I couldn’t create “${folderPath}”.`,
};
}
Code language: JavaScript (javascript)
After: Conversational response.
// src/app/(chat)/actions.ts (With AI Guide)
import { generateFriendlyReply } from '@/lib/ai-guide';
if (createFolderMatch) {
const folderPath = createFolderMatch[2].trim();
const ok = await createFolderOp(client, folderPath);
const defaultText = ok ? `Created folder “${folderPath}”.` : `I couldn’t create “${folderPath}”.`;
const friendlyText = await generateFriendlyReply({
userText: text,
defaultText: defaultText,
intent: 'create-folder',
});
assistantMsg = { /* ... */, text: friendlyText };
}
Code language: JavaScript (javascript)
With this change, a message like “Created folder “marketing”.” becomes “All set! The “marketing” folder has been created for you.” This small shift makes the assistant feel much more human.
You’ve successfully built a sophisticated AI-powered media assistant. By combining a reactive Next.js frontend with the powerful tooling of a Cloudinary MCP Server, you’ve created a conversational interface that can intelligently manage digital assets. This architecture is more than just a proof-of-concept; it’s a blueprint for the future of media management tools.
You now have a robust system that can be extended in many exciting ways.
Here are a few ideas to take this project to the next level:
- Expand the Toolset with More MCPs. The Cloudinary Asset Management MCP is just the beginning. You could integrate other MCPs to unlock new capabilities, such as performing complex transformations or analyzing video content, all from the same chat interface.
- Integrate AI-powered visual search. Allow users to upload an image and ask, “Find more assets like this.” This would involve creating or using an MCP tool that leverages Cloudinary’s advanced AI features for visual similarity search.
- Resource: Cloudinary AI Content Analysis Features
-
Add user accounts and a database. To turn this into a true multi-tenant service, integrate an authentication provider like NextAuth.js or Clerk. You could then store chat history, user-specific configurations, or API keys in a database like Vercel Postgres or Supabase.
-
Create custom tools. The Model Context Protocol is an open standard. You can create your own MCP server for any tool you can imagine, from triggering a GitHub Actions workflow to posting a message in Slack, and add it to your AI assistant’s list of capabilities.
- Resource: Model Context Protocol (MCP) on GitHub
View the final code on GitHub.