Skip to content

RESOURCES / BLOG

Step-by-Step Guide: Build an AI-Powered Image App With Cloudinary, Next.js, and Anthropic

Why It Matters

  • This post offers a clear blueprint to build AI-powered applications using a modern tech stack.
  • Turn text prompts into production-ready images in seconds with AI agents and complex Cloudinary transformations.
  • Generate various image formats, sizes, and compositions on the fly from a single base image.

The modern marketing asset is dynamic. It’s a 1080-square Instagram tile, a transparent PNG for a slide deck, and a TikTok thumbnail. Opening a design tool each time to repurpose an asset for hundreds of use cases is no longer realistic.

AI agents solve that problem by turning a chat prompt into a finished image:

  1. You describe the change in plain language.
  2. The agent chooses the correct Cloudinary transformation chain.
  3. A ready-to-use URL comes back in seconds and the entire exchange is stored for later reuse.

That’s exactly what Cloudi-Agent provides. With Anthropic Claude handling language, Cloudinary delivering real-time media transformations, and Convex storing per-user threads, a simple request like “Remove the background, make it square, then return a WebP” produces an optimized Cloudinary URL almost instantly without design tools or command-line scripts. Just chat.

The rest of this post shows the key pieces that power the agent, points to the full repository for deeper exploration, and explains how to fork the project to build your own image-savvy assistant.

Explore the live Cloudi-Agent in action or dive into the codebase to build your own.

Live Demo Video

Upload an image, describe what you want changed, and the AI agent will handle the whole process from optimization to visual transformation.

This app combines natural language chat with real-time image transformation, powered by Claude’s tool-use capabilities and built with a modern web stack. Before jumping into code, here’s how the system fits together and what you’ll need to follow along.

  • Claude (Anthropic). Processes natural language prompts and decides which transformation tool to invoke, using structured tool definitions and thread context.

  • Cloudinary (Tools). Each transformation (e.g., resize, background removal, or generative fill) is implemented as a callable function. Tool calls dynamically generate transformation URLs.

  • Convex. Handles backend logic and thread state. Stores image metadata and tool results using a serverless architecture and real-time data sync.

  • Next.js Frontend. Delivers a responsive chat interface with support for the View Transitions API. Includes message bubbles, thumbnails, tool results, and prompt buttons.

Make sure the following services and tools are set up:

  • Node.js 18+ (recommended: Node 20)

Use nvm to manage versions:

nvm install 20
nvm use 20
Code language: PHP (php)
  • Cloudinary Account For storing, delivering, and transforming uploaded images. cloudinary.com

  • Anthropic API Key Needed to access Claude’s tool-use API. anthropic.com

  • Convex Account Provides the backend infrastructure and real-time thread storage. convex.dev

Let’s get the project running locally. This section covers bootstrapping the app and installing the required dependencies.

Start by creating a fresh Next.js 15 app with the App Router and TypeScript enabled:

npx  create-next-app@latest  ai-agent  --app  --typescript  --tailwind

cd  ai-agent
Code language: CSS (css)

This sets up the base project with:

  • Next.js 15 (App Router)
  • Tailwind CSS
  • TypeScript

Next, install all necessary libraries:

npm  install  @cloudinary/url-gen  convex  @anthropic-ai/sdk @ai-sdk/anthropic  motion  lucide-react
Code language: CSS (css)

Then install UI components using shadcn/ui:

npx  shadcn-ui@latest  init
Code language: CSS (css)

When prompted:

  • Choose Tailwind CSS for styling.
  • Select App Router.
  • Accept the default paths.

Add the components you’ll be using:

npx  shadcn-ui@latest  add  button  textarea  card  dialog
Code language: CSS (css)

Before diving into code, let’s break down the key concepts that make this system work:

At the heart of this application is an AI-driven agent powered by Anthropic’s Claude, capable of tool-calling, which means it can decide when to invoke predefined functions based on the user’s prompt.

Each tool maps to a Cloudinary transformation, and these are invoked with structured input passed automatically by Claude.

All of this is coordinated and persisted through Convex threads, allowing stateful interactions, reusable context, and clean serverless function orchestration.

These three systems work together like this:

  1. User Prompt. “Remove the background and convert to WebP.”
  2. Claude interprets the request. Decides which tool(s) to call.
  3. Tool(s) execute Cloudinary URL logic. Generates new transformed image URL.
  4. Convex thread stores the result. For later reuse or contextual reference.
  5. Frontend displays it. With an image bubble and explanation.

Let’s now walk through each of these concepts step by step.

Tansform images on the fly using Cloudinary’s URL parameters, no image editing needed.

Each Claude tool outputs a specific Cloudinary URL based on structured input like:

{
  "width": 800,
  "height": 600,
  "format": "webp"
}
Code language: JSON / JSON with Comments (json)

Which becomes:

https://res.cloudinary.com/your-cloud/image/upload/c_fill,w_800,h_600,f_webp/your-image-id

This resizes, crops, and converts the image automatically.

Each tool maps inputs to a URL like this:

export function makeResizeUrl(id, { width, height, format }) {
  return `https://res.cloudinary.com/.../c_fill,w_${width},h_${height},f_${format}/${id}`;
}
Code language: JavaScript (javascript)

No backend rendering, just smart URL generation.

Next, we’ll look at how Claude decides which tool to call in the first place.

Claude isn’t just responding with plain text it can call tools based on your prompt. This is powered by Anthropic’s Tool-Use API, which lets you define functions Claude can invoke with structured input.

If you say, “Make this image 800×600 in webp format”, Claude might respond with:

{
  "tool_calls": [
    {
      "name": "resize",
      "input": {
        "width": 800,
        "height": 600,
        "format": "webp"
      }
    }
  ]
}
Code language: JSON / JSON with Comments (json)

Your backend sees this, runs the resize tool, and returns the result (a Cloudinary URL) back into the chat.

  • Claude chooses the right tool
  • You keep control of how the tool works
  • Each result is structured, traceable, and reusable

You define tools like this in the backend:

const tools = [
  {
    name: 'resize',
    description: 'Resize an image to given dimensions and format.',
    parameters: z.object({
      width: z.number(),
      height: z.number(),
      format: z.enum(['webp', 'png', 'jpg']),
    }),
  },
  ...
];
Code language: JavaScript (javascript)

To handle ongoing conversations with Claude and allow follow-ups like, “Actually, make it square”, we’ll use Convex agent threads.

Each thread tracks:

  • The original image
  • The user prompt history
  • Any tool results (like transformed URLs or tags)
  • The thread ID, which ties each message back to context
  • It’s serverless, so no need to manage a database.
  • Real-time updates are ideal for chat UIs.
  • Built-in support for functions, storage, and scheduling.

When a user sends a message, the backend:

  1. Passes the prompt (plus image context) to Claude.
  2. Saves any tool calls or responses to a thread.
  3. Streams the reply back to the UI.
const data = await createThread({
  prompt: contextPrompt,
  threadId: existingThreadId,
});
Code language: JavaScript (javascript)

And when Claude returns tool calls:

if (toolCall) {
  const url = makeCloudinaryUrl(toolCall.input);
  saveToThread({ threadId, result: url });
}
Code language: JavaScript (javascript)

Threads make it easy to follow up, rerun, or explain past results, which are essential for conversational UX.

The intelligence behind the chat experience is powered by a Convex Agent, leveraging Anthropic’s Tool-Use API and real-time Cloudinary transformations. This setup allows Claude to interpret prompts, call registered tools, and return useful results.

Each image transformation is implemented as a tool Claude can call with structured inputs. Tools are defined using zod schemas and registered inside the agent logic.

Example: A Cloudinary makeCloudinaryUrl tool:

const makeCloudinaryUrl: Tool = {
  name: "makeCloudinaryUrl",
  description: "Build a Cloudinary image URL with optional transformations.",
  parameters: z.object({
    publicId: z.string(),
    transformation: z.string(),
    width: z.number().optional(),
    height: z.number().optional(),
  }),
  execute: async ({ publicId, transformation }) => {
    const url = `https://res.cloudinary.com/.../upload/${transformation}/${publicId}`;
    return url;
  },
};

Code language: JavaScript (javascript)

Other tools are defined similarly to handle:

  • Resizing
  • Background removal
  • Generative fill
  • Format optimization
  • Recoloring

View the full definition here.

The actual execution happens inside a Convex internal action, which sends the prompt to Claude and lets it decide whether to respond directly or call a tool.

Here’s a simplified version of the agent handler:

export const createAgentAssistantThread = internalAction({
  args: { prompt: v.string(), threadId: v.optional(v.string()) },
  handler: async (ctx, args) => {
    const tools = [makeCloudinaryUrl, ...];

    const agent = createAgent({
      tools,
      model: 'claude-3-opus-20240229',
      apiKey: process.env.ANTHROPIC_API_KEY,
    });

    return await agent.respond({ ...args });
  },
});
Code language: JavaScript (javascript)

This is triggered on every user message to analyze the prompt and invoke tools if needed.

View the Convex agent code here.

Your backend needs several sensitive environment variables:

# Anthropic
ANTHROPIC_API_KEY=sk-ant-...

# Cloudinary
NEXT_PUBLIC_CLOUDINARY_CLOUD_NAME=your-cloud-name
CLOUDINARY_API_KEY=your-api-key
CLOUDINARY_API_SECRET=your-api-secret
Code language: PHP (php)
  • ANTHROPIC_API_KEY is used server-side only to call Claude.
  • CLOUDINARY_API_SECRET should never be exposed to the client.
  • NEXT_PUBLIC_CLOUDINARY_CLOUD_NAME can be used safely on the frontend for transformations.

Keep all secrets in .env.local and load them securely into Convex and API routes.

To let Claude interact with Cloudinary and return intelligent results, we’ll expose a small set of server-side API routes. These handle:

  • Running transformations.
  • Returning structured responses.
  • Enriching results with metadata (e.g., tags).

This route receives tool calls from Claude (via the Convex agent) and executes them by generating Cloudinary transformation URLs.

When Claude chooses to invoke a tool like makeCloudinaryUrl, it sends the parameters to this route.

In this route, we’ll:

  • Parse the incoming tool call.
  • Call the appropriate transformation.
  • Return a structured result with the new image URL and explanation.

Example (simplified):

export async function POST(req: Request) {
  const body = await req.json();

  const tool = body.tool_call;

  if (tool.name === "makeCloudinaryUrl") {
    const { publicId, transformation } = tool.arguments;

    const url = `https://res.cloudinary.com/.../upload/${transformation}/${publicId}`;

    return Response.json({
      type: "cloudinaryUrl",
      url,
    });
  }

  return Response.json({ error: "Unknown tool" }, { status: 400 });
}
Code language: JavaScript (javascript)

View full route → src/app/api/analyze/route.ts

We’ll use this endpoint to add auto-generated tags to uploaded images using AWS Rekognition. These tags are merged with any tool-generated metadata and shown in the chat bubble for clarity.

In this route, we’ll:

  • Accept a public Cloudinary image URL.
  • Call AWS Rekognition for tag detection.
  • Return an array of descriptive labels.

Example:

export async function POST(req: Request) {
  const { imageUrl } = await req.json();

  const tags = await detectLabelsUsingRekognition(imageUrl); // AWS SDK

  return Response.json({ tags });
}
Code language: JavaScript (javascript)

This enhances Claude’s response with useful visual descriptors like:

  • “Person”
  • “Laptop”
  • “Urban”
  • “Food”
  • And more!

View the full route.

The frontend of this project is a minimal, chat-like interface where users upload an image and describe the transformation they’d like. Claude handles the rest.

Here’s how the core experience is wired:

The chat input accepts image uploads (Cloudinary unsigned preset) and passes the image metadata into the conversation thread.

// src/components/chat-input.tsx

<UploadButton onUpload={handleUpload} />;

const handleUpload = (info) => {
  setMessages([
    ...messages,
    { role: "user", content: { type: "image", ...info } },
  ]);
  setLatestImage(info); // for Claude context
};
Code language: JavaScript (javascript)

See the full logic here.

Prompts are passed to a Convex action that sends them to the AI agent along with the latest image context.

// src/components/Chat.tsx

const contextPrompt = latestImage
  ? `${message}\n(use publicId ${latestImage.publicId})`
  : message;

const data = await createThread({ prompt: contextPrompt, threadId });
Code language: JavaScript (javascript)

View the full agent setup here.

Messages are rendered conditionally based on type: text, image, or tool result.

// src/components/ChatMessages.tsx

{
  typeof msg.content === "string" && <p>{msg.content}</p>;
}
{
  msg.content.type === "image" && <ImageMessage data={msg.content} />;
}
{
  msg.content.toolResults && (
    <RenderToolResult result={msg.content.toolResults} />
  );
}
Code language: JavaScript (javascript)

Includes support for:

  • Claude responses (text or structured).
  • Tags, analysis, rewritten versions.
  • View transitions for smooth UI flow.

The full rendering logic is here.

Prebuilt prompts simulate user input and invoke Claude with a single click.

// src/components/DemoPrompts.tsx

["Make it brighter", "Fix grammar", "Auto-tag this"].map((text) => (
  <button onClick={() => sendPrompt(text)}>{text}</button>
));
Code language: JavaScript (javascript)

The UI logic is minimal. The heavy lifting happens behind the scenes through Claude’s structured tool calls. View transitions and clean design keep the experience fast and focused.

Explore the full chat flow.

View a complete exchange.

Let’s trace how a complete interaction flows through the system:

  1. The User uploads an image. → Sent directly to Cloudinary via unsigned upload preset. → Metadata is saved locally + in Convex.

  2. The User enters a prompt. → Combined with the image’s publicId. → Sent to a Convex action calling the Claude agent.

  3. Claude analyzes the prompt. → Picks the right tool (e.g., resize, recolor, generativeFill). → Sends structured input like { width: 800, format: 'webp' }.

  4. Tool executes the transformation. → makeCloudinaryUrl() builds a dynamic transformation URL. → The final Cloudinary image is returned.

  5. Result is shown in the chat. → Rendered as an image bubble or analysis card. → Optionally includes tags, rewritten versions, or scores.

Everything is stateless and composable; each message in the thread can independently produce new, visual output using Claude + Cloudinary.

Dive into the full repo.

Conversational interfaces are reshaping how we edit and deliver visual media. By combining Anthropic’s tool-aware language model, Cloudinary’s on-the-fly transformations, Convex’s real-time backend logic, and the modern Next.js stack, you can turn a simple chat prompt into production-ready imagery in seconds.

The architecture is deliberately small:

  • A handful of clearly defined tools.
  • One Convex agent to mediate between Claude and Cloudinary.
  • A minimal chat UI that renders whatever the agent returns.

From here you can extend the project with new tools, such as watermarking, face blurring, or video transcoding, without changing the frontend. Swap Claude for another model, add authentication, or plug the agent into an existing CMS. The pattern stays the same: prompt in, structured tool call out, transformed asset back.

Feel free to explore the repository, deploy your own instance, and adapt the code to your workflow. Conversation is now an API surface, so use it to build the next generation of your visual media tools. Sign up for a free Cloudinary accountto get started.

Start Using Cloudinary

Sign up for our free plan and start creating stunning visual experiences in minutes.

Sign Up for Free