One of the major challenges in data management is eliminating redundancy. It’s not uncommon to find identical copies of digital assets, such as images, videos, etc., in databases and datasets. While various methods have been developed to address this issue, the efficiency and scalability of these methods remain a difficult task. The focus of this article, however, will be on duplicate image detection, which is the process of identifying whether two or more images are identical.
In real-world scenarios, practical use cases of duplicate image detection include identifying duplicate images in datasets used for training machine learning models, detecting identical images in cloud storage services like Google Photos to remove redundant files, and so on. These ultimately reduce storage costs and performance bottlenecks.
In this article, we’ll explore three techniques for identifying duplicate images, from simple pixel-by-pixel comparisons to using a ready-made solution like the Cloudinary Duplicate Image Detection add-on.
In this article:
- Why Should You Care About Duplicate Images?
- #1. Pixel-by-pixel Comparison
- #2. Hashing
- #3. Cloudinary’s Duplicate Image Detection Add-on
Why Should You Care About Duplicate Images?
Duplicate images may seem like a minor issue, but they can quickly become a hidden drag on your media management. Imagine dozens or even hundreds of copies of the same image scattered across your servers. Each one eats up precious storage, inflating costs unnecessarily and slowing down your system performance.
When duplicate images accumulate, they can also confuse workflows, making tracking the latest version or maintaining consistency across platforms more difficult. This can lead to wasted time and inefficiency as developers and teams scramble to identify the right files or push multiple versions of the same asset.
Beyond workflow disruptions, duplicate images can negatively impact your website’s performance. Larger image libraries mean slower load times and poor optimization, which may harm your SEO efforts.
Additionally, keeping unnecessary duplicates makes it harder to manage image transformations. If your tools are processing multiple identical files, the risk of generating redundant outputs increases, further straining both your server resources and your team’s productivity.
#1. Pixel-by-pixel Comparison
The pixel-by-pixel comparison method of detecting duplicate images is one of the most straightforward and flexible. It involves comparing each pixel’s color value in both images; if every pixel matches, the images are considered identical.
For example, in the Python code below, the two images are first converted to NumPy arrays, where each pixel of the image becomes an element in the array. The arrays are then compared to determine if they contain equal values:
from PIL import Image import numpy as np def are_images_identical(image_path1, image_path2): img1 = Image.open(image_path1) img2 = Image.open(image_path2) # Convert images to numpy arrays img1_np = np.array(img1) img2_np = np.array(img2) # Compare images pixel-by-pixel return np.array_equal(img1_np, img2_np) img1 = 'sample-image.jpg' img2 = 'stock-image.png' print("The images are identical." if are_images_identical(img1, img2) else "The images are different.")
While this method looks simple to implement, it has some drawbacks that make it unsuitable for use in certain scenarios. One major disadvantage is that it won’t produce accurate results with images that are transformed or edited in some way, such as resized or compressed images.
#2. Hashing
Image hashes (also known as image fingerprints) are strings of random letters and numbers generated by hash algorithms that are compared to determine if two images are identical, unlike cryptographic hashing algorithms such as MD5 and SHA-1, which use tiny changes in the image to give completely different hashes, image hash algorithms do not.
There are different types of image hash algorithms, including average, perceptual, difference, and wavelet compression algorithms. These algorithms analyze an image structure based on its luminance without its color information.
For instance, to determine whether two images are identical, the hashes of the two images are generated, and the number of bit positions in the two images that are different are counted. This is known as the Hamming distance. A Hamming distance of zero indicates that the images are highly similar; a distance of 1-10 suggests the images have few different things, while a distance of more indicates that the two images are entirely different.
In Python, this can be done using the ImageHash library to generate image hashes for comparison.
Let’s look at an example.
from PIL import Image import imagehash def compare_images(image_path1, image_path2): img1 = Image.open(image_path1) img2 = Image.open(image_path2) # Compute perceptual hashes for both images hash1 = imagehash.dhash(img1) hash2 = imagehash.dhash(img2) # Calculate Hamming Distance hamming_distance = hash1 - hash2 print(f"Hash 1: {hash1}") print(f"Hash 2: {hash2}") print(f"Hamming Distance: {hamming_distance}") if hamming_distance <= 5: print("The images are identical.") else: print("The images are different.") img1 = 'sample-image.jpg' img2 = 'dinning-table.jpg' compare_images(img1, img2)
Here are the two images used in the example above:
And below is the output after running the code:
Hash 1: 1d2d6d2e8d193d5c Hash 2: c7871f1e2ef6de76 Hamming Distance: 34 The images are different.
As expected, the Hamming distance between the two images is 34, which confirms that they are not identical.
Now, what happens if we run the code using the same image for both inputs? Let’s find out by modifying the function call to:
compare_images(img1, img1)
Which gives the following output:
Hash 1: 1d2d6d2e8d193d5c Hash 2: 1d2d6d2e8d193d5c Hamming Distance: 0 The images are identical.
As expected, the Hamming distance is 0, confirming the images are identical. Also, notice that the hash values remain the same as in the previous example.
Compared to the pixel-by-pixel comparison, the biggest advantage of this method is that it works well even when the images have undergone transformations such as rotation, scaling, or color shifts.
#3. Cloudinary’s Duplicate Image Detection Add-on
Cloudinary add-ons offer unique transformation and customization, some of which are based on advanced Cloudinary AI functionality and media processing partners. The Cloudinary Duplicate Image Detection add-on allows you to check for duplicate images (and videos) in your cloud product environment.
Cloudinary uses perceptual hashing algorithms to generate unique fingerprints for images to detect if they are the same. These fingerprints allow Cloudinary to determine if two images are duplicates, even if they differ slightly in compression, resolution, contrast, or brightness.
Using this add-on in your app involves setting a custom threshold that determines how close the fingerprints of two images need to be for a match. A threshold of 1.0 means the images must be exact duplicates, while lower values allow for more subtle differences. For example, two images with slight resolution, brightness, or contrast differences could still be considered duplicates depending on the threshold setting.
The image detection add-on can be used when uploading new images to Cloudinary or when uploading images already uploaded to your Cloudinary product environment.
Note: The Cloudinary Duplicate Image Detection add-on is currently in beta, so you may need to contact support if you want to use it in your application to avoid breaking changes.
Detect Duplicates in Existing Images
To detect duplicates for images already uploaded to your product environment, you can use the explicit
method, which allows actions to be applied to existing assets. Additionally, the moderation parameter can be used to set the duplicate detection threshold, as shown below:
import cloudinary import cloudinary.uploader import cloudinary.api cloudinary.config( cloud_name='<YOUR_CLOUD_NAME>', api_key='<YOUR_API_KEY>', api_secret='<YOUR_API_SECRET>' ) response = cloudinary.uploader.explicit("<IMAGE_PUBLIC_ID>", type = "upload", moderation = "duplicate:0.8", notification_url = "https://webhook.site/d7305f71-6087-4fe3-9d91-e52f09e1fa34" ) print(response)
In the above example, only one image’s public_id is set. This means the moderation only applies to that image in your Cloudinary product environment.
In most cases, you want almost all the images in your storage to be included in the duplicate detection set. To do that, you can use the resources API to upload all the assets to your product environment and then use a for loop to add each image to the search.
Add New Images to the Duplicate Image Search
If you want subsequent images that are uploaded to be added to the duplicate image search set, you can simply set the moderation parameter when uploading the image as follows:
response = cloudinary.uploader.upload( 'new_image.jpg', moderation='duplicate:0.8', )
Cloudinary’s Duplicate Image Detection add-on is a powerful tool for identifying duplicates in your media assets storage compared to other methods discussed above. It abstracts away the complexities of setting up your database and writing your code from scratch.
To learn more about Cloudinary and the Duplicate Image Detection add-on, feel free to sign up for a free account today and also check out the Cloudinary docs for more information on the add-on.