Cloudinary Blog

How-to automatically identify similar images using pHash

Image fingerprinting Identify similar images using pHash

Photos today can be easily edited by means of resizing, cropping, adjusting the contrast, or changing an image’s format. As a result, new images are created that are similar to the original ones. Websites, web applications and mobile apps that allow user generated content uploads can benefit from identifying similar images.

Image de-duplication

If your site allows users to upload images, they can also upload various processed or manipulated versions of the same image. As described above, while the versions are not exactly identical, they are quite similar.

Obviously, it’s good practice to show several different images on a single page and avoid displaying similar images. For example, travel sites might want to show different images of a hotel room, but avoid having similar images of the room on the same page.

Webinar
How to Optimize for Page Load Speed

Additionally, if your web application deals with many uploaded images, you may want to be able to automatically recognize if newly uploaded images are similar to previously uploaded images. Recognizing similar images can prevent duplicate images from being used once they are uploaded, allowing you to better organize your site’s content. The better your web application is better at identifying similar images upon upload, the more duplicated images will no longer be a thing.

Duplicated images will no longer be a thing because similar images upon upload will be identified

Image similarity identification

Cloudinary uses perceptual hash (pHash), which acts as an image fingerprint. This mathematical algorithm analyzes an image's content and represents it using a 64-bit number fingerprint. Two images’ pHash values are "close" to one another if the images’ content features are similar. By comparing two image fingerprints, you can tell if they are similar.

You can request the pHash value of an image from Cloudinary for any uploaded image, either using Cloudinary's upload API, or for any previously uploaded image in your media library using our admin API. You can simply set the phash parameter to true, which produces the image's pHash value. This image similarity algorithm is incredibly powerful and easy to use. Check out the example below:

Using the following image for example:

Original koala photo

Below is a code sample in Ruby that shows how to upload this image with a request for the pHash value:

Copy to clipboard
Cloudinary::Uploader.upload("koala1.jpg", :public_id => "koala1", :phash => true)

The result below shows the returned response with the calculated pHash value:

Copy to clipboard
    {
     "public_id": "koala1",
     "version": 1424266415,
     "width": 887,
     "height": 562,
     "format": "jpg",
     "etag": "6f821ea4478af3e3a183721c0755cb1b",
    ...
     "phash": "ba19c8ab5fa05a59"
    }

The examples below demonstrate multiple similar images and their pHash values. Let's compare the pHash values and find the distance between each pair. If you XOR two of the pHash values and count the “1’s” in the result, you get a value between 0-64. The lower the value, the more similar the images are. If all 64 bits are the same, the photos are very similar.

The similarity score of the examples below expresses how each image is similar to the original image. The score is calculated as 1 - (phash_distance(phash1, phash2) / 64.0) in order to give a result between 0.5 and 1 (phash_distance can be computed using bit_count(phash1 ^ phash2) in MySQL for example).

Original koala thumbnail
887x562 JPEG, 180 KB
pHash: ba19c8ab5fa05a59

Grayscale koala
887x562 JPEG, 149 KB
Difference: grayscale.
pHash: ba19caab5f205a59
Similarity score: 0.96875

Cropped koala photo with increased saturation
797x562 JPEG, 179 KB
Difference: cropped, increased color saturation.
pHash: ba3dcfabbc004a49
Similarity score: 0.78125

Cropped koala photo with lower JPEG quality
887x509 JPEG, 30.6 KB
Difference: cropped, lower JPEG quality.
pHash: 1b39ccea7d304a59
Similarity score: 0.8125

Another koala photo
1000x667 JPEG, 608 KB
Difference: a different koala photo...
pHash: 3d419c23c42eb3db
Similarity score: 0.5625

Not a koala photo
1000x688 JPEG, 569 KB
Difference: not a koala...
pHash: f10773f1cd269246
Similarity score: 0.5

v

As you can see from the results above that the three images that appear to be similar to the original received a high score when they were compared. While other comparison results showed significantly less similarity.

By using Cloudinary to upload users’ photos to your site or application, you can request the pHash values of the uploaded images and store them on your servers. That allows you to identify which images are similar and decide what the next step should be. Building image matcher type of apps would be a lot easier. You may want to keep similar images, classify them in your database, filter them out, or interactively allow users to decide which images they want to keep.

Summary

This feature is available for any Cloudinary plan, including the free tier. As explained above, you can use Cloudinary’s API to get an image’s fingerprint and start checking for similarities. In addition, it is in our roadmap to further enhance our similar image search and de-duplication capabilities.

About Cloudinary

Cloudinary provides easy-to-use, cloud-based media management solutions for the world’s top brands. With offices in the US, UK and Israel, Cloudinary has quickly become the de facto solution used by developers and marketers at major companies around the world to streamline rich media management and deliver optimal end-user experiences.

For more information, visit www.cloudinary.com or follow us on Twitter.

Recent Blog Posts

Maya Shavin: How I Built My Website

Besides working as a senior front-end developer at Cloudinary, I'm also a content creator, a blogger, and an open-source developer. Follow me at @mayashavin and on mayashavin.com.

In the beginning, my website, mayashavin.com, was mainly for showcasing the status of my development projects and keeping me organized with my speaking schedule. Initially, I built it with Vue.js, later on switching to Nuxt.js (aka Nuxt) for a higher SEO score, and deployed it with Netlify. After some time, I added a blog section with Netlify CMS as the content management system (CMS). Everything was fine until I added more content and features, which led to a significant decline in the site’s performance. Also, the site design needed a modern look. So, I gave the site a makeover.

Read more
Automation Frees Up PetRescue’s Staff to Help Pets Find Their Forever Homes

As we spend more time at home, many of us are adopting pets for the joy, companionship and a surprising range of health benefits. In Australia, where our nonprofit customer PetRescue is located, there’s a shortage of pets to adopt. Last August, the Guardian reported that dog shelters in Australia emptied and adoption fees for puppies were running as high as $AUS1800.

Read more
Cloudinary and Contentful Make Modern Content Management Easier

I am pleased to share that Cloudinary and Contentful have joined forces to further streamline the creation, processing, and delivery of online content through Cloudinary’s digital asset management (DAM) solution and advanced transformation and delivery capabilities for images and video. What’s more, the partnership delivers a headless approach to DAM. By leveraging APIs for media management tasks, marketers and developers alike benefit from an integrated stack of optimized assets for optimization and automation. As a result, page loads are fast and beautiful, and at scale—with less overhead and effort.

Read more
Introducing Cloudinary's Nuxt Module

Since its initial release in October 2016 by the Chopin brothers as a server-side framework that runs on top of Vue.js, Nuxt (aka Nuxt.js) has gained prominence in both intuitiveness and performance. The framework offers numerous built-in features based on a modular architecture, bringing ease and simplicity to web development. Not surprisingly, Nuxt.js has seen remarkable growth in adoption by the developer community along with accolades galore. At this writing, Nuxt has earned over 30K stars on GitHub and 96 active modules with over a million downloads per month. And the upward trend is ongoing.

Read more
How Quality and Quantity can go Hand in Hand

When it comes to quality versus quantity, you’ll often hear people say, “It’s the quality that counts, not the quantity”. While that’s true in many situations, there are also cases where you want both quality and quantity. You may have thousands of images on your website and you want them all to look great. This is especially important if your website allows users to upload their own content, for example, to sell their own products or services. You don't want their poor quality images to reflect badly on your brand.

Read more