Skip to content

Cloudinary AI Vision: Bringing GenAI to Media Management

Introducing AI Vision. The latest Cloudinary cutting-edge feature integrates the power of generative AI directly into media management to efficiently classify, moderate, and describe visual content at scale. Cloudinary AI Vision streamlines media management making it more precise and practical while vastly improving your team’s productivity.

AI Vision is Cloudinary’s latest generative AI solution and brings generative AI capabilities to key media management workflows. 

AI Vision uses a generative multimodal LLM enhanced with specialized models, algorithms, and tailored prompts that address existing LLM blind spots. This allows it to interpret and respond to visual content and queries. Essential media management tasks such as content classification, image moderation, and custom descriptions can hence be automated with AI vision. 

Standard AI models often require complex and extensive training to meet different brands’ needs. AI Vision eliminates the intricacies of model training by offering a flexible, ready-to-use solution that integrates effortlessly with Cloudinary’s Digital Asset Management (DAM) platform. 

  • Custom taxonomy and image classification. AI Vision supports a teams’ unique taxonomy without requiring training or fine-tuning tagging models. Tags can be used with custom and specific descriptions to categorize images according to their branding and organizational needs. This customization allows for accurate tagging based on detailed criteria such as visual product attributes, background color, subject orientation, or demographic details — whatever your organization’s unique taxonomy needs may be.
  • Content moderation and compliance. AI Vision streamlines brand safety and compliance checks with its automated moderation capabilities. It offers clear answers to compliance-related queries, allowing brands to quickly identify potentially sensitive content and uphold consistent standards across multiple platforms. Whether checking for public figures in images or ensuring that visuals do not contain violent or inappropriate content, AI Vision provides precise, automated moderation at scale.
  • General questions and tasks. AI Vision enables users to receive detailed, contextually aware answers to questions about their images. By analyzing visual content, the AI identifies objects, scenes, and in-image text, making media assets more searchable and logically organized. For instance, users can ask AI Vision to describe the setting of an image or identify specific elements, such as the number of people or objects in a scene. AI vision can also be used for standard task completion such as image CTA suggestions and captioning. This capability allows for more efficient media management and retrieval.

AI Vision simplifies managing large media libraries by allowing you to create robust generative AI back-end processes. To get started, let’s look at a few examples of AI Vision’s features in action.

To use AI Vision, customers will need to subscribe to the Cloudinary AI Vision add-on and consume it via the Analyze API service. You can subscribe by logging into your Cloudinary account and navigating to your add-ons screen under settings. From there, you can activate the add-on and start using AI Vision.

Check out this page for more information on Cloudinary’s AI Vision Add-on. If you’d like to learn more about the Analyze API service, click here.

AI Vision works by submitting requests to the AI Vision API and getting the response back in JSON format. If you need to store information on the asset returned by AI Vision, you can use this response in your custom workflows, including Cloudinary MediaFlows. Let’s concentrate on the API responses for these use cases.

Users can automate the tagging of images based on specific organizational needs. In this use case, users would define their tags and custom definitions to suit their needs. For example, they can create a taxonomy that identifies if an image has a human model and if certain clothing accessories are present. They then can make an automated flow that organizes the images based on the response, which will only contain the tags that match the image. Let’s see this in action below. As you can see in the request, we’ve created our definitions for the tags, and AI Vision has responded accordingly, returning only the tags for “model” and “dress,” which are what was detected:

POST /analysis/{cloud_name}/analyze/ai_vision_tagging

{

  "source": {

    "uri": "https://res.cloudinary.com/siedner/image/upload/v1725388251/cldrop/pexels-sedat-yetis-248508609-19985033_e5mhxk.jpg"

  },

  "tag_definitions": [

    {"name": "model", "description": "Does the image contain a person?"},

    {"name": "back-facing", "description": "Does the image show someone who is back facing?"},

    {"name": "dress", "description": "Does the image show someone wearing a dress?"},

    {"name": "bag", "description": "does the image show someone holding a handbag?"},

  ]

}Code language: JavaScript (javascript)
{ "limits": { "usage": { "type": "ai_vision", "count": 1925 } }, "request_id": "5dda92ecfc6925279689b1c840e13745", "data": { "entity": "https://res.cloudinary.com/siedner/image/upload/v1725388251/cldrop/pexels-sedat-yetis-248508609-19985033_e5mhxk.jpg", "analysis": { "tags": [ { "name": "model" }, { "name": "dress" } ], "model_version": 1 } } }Code language: JSON / JSON with Comments (json)

AI Vision can quickly address questions like, “Is there anything in the image that could be considered violent or disturbing?” It provides an automated response to help brands meet compliance standards efficiently, especially when dealing with content like UGC and third-party uploads. In this example, we’ve asked two simple questions regarding the image:

Does it clearly show any logos or other IP?

Does it contain any offensive or NSFW elements?

POST /analysis/{cloud_name}/analyze/ai_vision_moderation

{

  "source": {

    "uri": "https://res.cloudinary.com/siedner/image/upload/v1725646190/cldrop/pexels-mlkbnl-8633368_y8qplj.jpg"

  },

  "rejection_questions": [

    "Does it clearly show any logos or other IP?",

    "Does it contain any offensive or NSFW elements?"

    ]

}Code language: JavaScript (javascript)
{ "limits": { "usage": { "type": "ai_vision", "count": 2542 } }, "request_id": "1204425015234600630037349fca1ff6", "data": { "entity": "https://res.cloudinary.com/siedner/image/upload/v1725387515/cldrop/pexels-monurblc-27124723_rghu5p.jpg", "analysis": { "responses": [ { "prompt": "Does it clearly show any logos or other IP?", "value": "no" }, { "prompt": "Does it contain any offensive or NSFW elements?", "value": "no" } ], "model_version": 1 } } }Code language: JSON / JSON with Comments (json)

For this example, we asked AI Vision to provide an alt tag and a caption for the image. AI Vision responded, adhering to our length constraints as requested. This is a very helpful example for creating SEO-friendly and accessible text.

POST /analysis/{cloud_name}/analyze/ai_vision_general

{

  "source": {

    "uri": "https://res.cloudinary.com/demo/image/upload/sample.jpg"

  },

  "prompts": [

     “provide a seo friendly description suitable for an alt tag in under 100 characters”,

“please provide a 25 word caption for this image”

    ]

}Code language: JSON / JSON with Comments (json)
{ "limits": { "usage": { "type": "ai_vision", "count": 1386 } }, "request_id": "e7aa53f6d1aa8a3ecafb91215b573c9b", "data": { "entity": "https://res.cloudinary.com/siedner/image/upload/v1725386162/cldrop/pexels-heyho-7031705_cu9bjc.jpg", "analysis": { "responses": [ { "value": "Modern gym with treadmills, equipment, wood floors, and intricate ceiling design by large windows" }, { "value": "State-of-the-art fitness center boasts sleek equipment, hardwood floors, and an eye-catching geometric ceiling. Bathed in natural light from floor-to-ceiling windows, it offers a luxurious workout experience with panoramic views." } ], "model_version": 1 } } }Code language: JSON / JSON with Comments (json)

AI Vision stands apart from traditional AI tools by combining visual and textual data for a more comprehensive understanding of content. This intelligence enables businesses to build tailored workflows for media management that will align with unique brand and customer expectations. Unlike standard LLM wrappers, AI Vision integrates a foundational model enhanced with specialized algorithms, prompt engineering, and fine-tuning that address specific industry needs and common LLM blind spots. This tailored approach ensures more accurate, brand-specific outcomes right out of the box.

Additionally, AI Vision democratizes access to advanced AI capabilities, allowing brands to leverage powerful media workflows without needing extensive AI budgets or specialized teams. AI Vision is an out-of-the-box solution that eliminates the complexities of building, training, and hosting custom models, making high-quality AI-driven media management accessible to all teams regardless of size.

AI Vision by Cloudinary is a powerful, user-friendly solution for modern media management needs. Businesses can classify, moderate, and describe images more efficiently than ever by leveraging generative AI. Whether automating compliance checks, enhancing image searchability, or developing custom tagging workflows, AI Vision delivers the precision necessary to handle large-scale digital asset management. Contact us today to learn more.

Back to top

Featured Post