3 Easy Ways to Eliminate Duplicate Images

find duplicate images

One of the major challenges in data management is eliminating redundancy. It’s not uncommon to find identical copies of digital assets, such as images, videos, etc., in databases and datasets. While various methods have been developed to address this issue, the efficiency and scalability of these methods remain a difficult task. The focus of this article, however, will be on duplicate image detection, which is the process of identifying whether two or more images are identical.

In real-world scenarios, practical use cases of duplicate image detection include identifying duplicate images in datasets used for training machine learning models, detecting identical images in cloud storage services like Google Photos to remove redundant files, and so on. These ultimately reduce storage costs and performance bottlenecks.

In this article, we’ll explore three techniques for identifying duplicate images, from simple pixel-by-pixel comparisons to using a ready-made solution like the Cloudinary Duplicate Image Detection add-on.

In this article:

Why Should You Care About Duplicate Images?
#1. Pixel-by-pixel Comparison
#2. Hashing
#3. Cloudinary’s Duplicate Image Detection Add-on

Why Should You Care About Duplicate Images?

Duplicate images may seem like a minor issue, but they can quickly become a hidden drag on your media management. Imagine dozens or even hundreds of copies of the same image scattered across your servers. Each one eats up precious storage, inflating costs unnecessarily and slowing down your system performance.

When duplicate images accumulate, they can also confuse workflows, making tracking the latest version or maintaining consistency across platforms more difficult. This can lead to wasted time and inefficiency as developers and teams scramble to identify the right files or push multiple versions of the same asset.

Beyond workflow disruptions, duplicate images can negatively impact your website’s performance. Larger image libraries mean slower load times and poor optimization, which may harm your SEO efforts.

Additionally, keeping unnecessary duplicates makes it harder to manage image transformations. If your tools are processing multiple identical files, the risk of generating redundant outputs increases, further straining both your server resources and your team’s productivity.

#1. Pixel-by-pixel Comparison

The pixel-by-pixel comparison method of detecting duplicate images is one of the most straightforward and flexible. It involves comparing each pixel’s color value in both images; if every pixel matches, the images are considered identical.

For example, in the Python code below, the two images are first converted to NumPy arrays, where each pixel of the image becomes an element in the array. The arrays are then compared to determine if they contain equal values:

from PIL import Image
import numpy as np

def are_images_identical(image_path1, image_path2):

    img1 = Image.open(image_path1)
    img2 = Image.open(image_path2)


    # Convert images to numpy arrays
    img1_np = np.array(img1)
    img2_np = np.array(img2)


    # Compare images pixel-by-pixel
    return np.array_equal(img1_np, img2_np)

img1 = 'sample-image.jpg'
img2 = 'stock-image.png'

print("The images are identical." if are_images_identical(img1, img2) else "The images are different.")

While this method looks simple to implement, it has some drawbacks that make it unsuitable for use in certain scenarios. One major disadvantage is that it won’t produce accurate results with images that are transformed or edited in some way, such as resized or compressed images.

#2. Hashing

Image hashes (also known as image fingerprints) are strings of random letters and numbers generated by hash algorithms that are compared to determine if two images are identical, unlike cryptographic hashing algorithms such as MD5 and SHA-1, which use tiny changes in the image to give completely different hashes, image hash algorithms do not.

There are different types of image hash algorithms, including average, perceptual, difference, and wavelet compression algorithms. These algorithms analyze an image structure based on its luminance without its color information.

For instance, to determine whether two images are identical, the hashes of the two images are generated, and the number of bit positions in the two images that are different are counted. This is known as the Hamming distance. A Hamming distance of zero indicates that the images are highly similar; a distance of 1-10 suggests the images have few different things, while a distance of more indicates that the two images are entirely different.

In Python, this can be done using the ImageHash library to generate image hashes for comparison.

Let’s look at an example.

from PIL import Image
import imagehash

def compare_images(image_path1, image_path2):

    img1 = Image.open(image_path1)
    img2 = Image.open(image_path2)


    # Compute perceptual hashes for both images
    hash1 = imagehash.dhash(img1)
    hash2 = imagehash.dhash(img2)


    # Calculate Hamming Distance
    hamming_distance = hash1 - hash2


    print(f"Hash 1: {hash1}")
    print(f"Hash 2: {hash2}")
    print(f"Hamming Distance: {hamming_distance}")


    if hamming_distance <= 5:
        print("The images are identical.")
    else:
        print("The images are different.")

img1 = 'sample-image.jpg'
img2 = 'dinning-table.jpg'

compare_images(img1, img2)

Here are the two images used in the example above:

Hashing example 1

Hashing example 2

And below is the output after running the code:

Hash 1: 1d2d6d2e8d193d5c
Hash 2: c7871f1e2ef6de76
Hamming Distance: 34
The images are different.

As expected, the Hamming distance between the two images is 34, which confirms that they are not identical.

Now, what happens if we run the code using the same image for both inputs? Let’s find out by modifying the function call to:

compare_images(img1, img1)

Which gives the following output:

Hash 1: 1d2d6d2e8d193d5c
Hash 2: 1d2d6d2e8d193d5c
Hamming Distance: 0
The images are identical.

As expected, the Hamming distance is 0, confirming the images are identical. Also, notice that the hash values remain the same as in the previous example.

Compared to the pixel-by-pixel comparison, the biggest advantage of this method is that it works well even when the images have undergone transformations such as rotation, scaling, or color shifts.

#3. Cloudinary’s Duplicate Image Detection Add-on

Cloudinary add-ons offer unique transformation and customization, some of which are based on advanced Cloudinary AI functionality and media processing partners. The Cloudinary Duplicate Image Detection add-on allows you to check for duplicate images (and videos) in your cloud product environment.

Cloudinary uses perceptual hashing algorithms to generate unique fingerprints for images to detect if they are the same. These fingerprints allow Cloudinary to determine if two images are duplicates, even if they differ slightly in compression, resolution, contrast, or brightness.

Using this add-on in your app involves setting a custom threshold that determines how close the fingerprints of two images need to be for a match. A threshold of 1.0 means the images must be exact duplicates, while lower values allow for more subtle differences. For example, two images with slight resolution, brightness, or contrast differences could still be considered duplicates depending on the threshold setting.

The image detection add-on can be used when uploading new images to Cloudinary or when uploading images already uploaded to your Cloudinary product environment.

Note: The Cloudinary Duplicate Image Detection add-on is currently in beta, so you may need to contact support if you want to use it in your application to avoid breaking changes.

Detect Duplicates in Existing Images

To detect duplicates for images already uploaded to your product environment, you can use the explicit method, which allows actions to be applied to existing assets. Additionally, the moderation parameter can be used to set the duplicate detection threshold, as shown below:

import cloudinary
import cloudinary.uploader
import cloudinary.api

cloudinary.config(
  cloud_name='<YOUR_CLOUD_NAME>',
  api_key='<YOUR_API_KEY>',
  api_secret='<YOUR_API_SECRET>'
)

response = cloudinary.uploader.explicit("<IMAGE_PUBLIC_ID>", type = "upload",
  moderation = "duplicate:0.8",
  notification_url = "https://webhook.site/d7305f71-6087-4fe3-9d91-e52f09e1fa34"
  )

print(response)

In the above example, only one image’s public_id is set. This means the moderation only applies to that image in your Cloudinary product environment.

In most cases, you want almost all the images in your storage to be included in the duplicate detection set. To do that, you can use the resources API to upload all the assets to your product environment and then use a for loop to add each image to the search.

Add New Images to the Duplicate Image Search

If you want subsequent images that are uploaded to be added to the duplicate image search set, you can simply set the moderation parameter when uploading the image as follows:

response = cloudinary.uploader.upload(
    'new_image.jpg', 
    moderation='duplicate:0.8', 
)

Cloudinary’s Duplicate Image Detection add-on is a powerful tool for identifying duplicates in your media assets storage compared to other methods discussed above. It abstracts away the complexities of setting up your database and writing your code from scratch.

To learn more about Cloudinary and the Duplicate Image Detection add-on, feel free to sign up for a free account today and also check out the Cloudinary docs for more information on the add-on.

QUICK TIPS

Colby Fayock

In my experience, here are tips that can help you better eliminate duplicate images:

Use hybrid detection methods for higher accuracy
Combine pixel-by-pixel comparison with hash-based methods for better detection. This hybrid approach can catch exact duplicates and those with minor transformations.
Leverage metadata for early filtering
Before performing computationally expensive pixel or hash comparisons, filter images by metadata (e.g., file size, dimensions, or format) to eliminate obviously different images.
Automate periodic scans for duplicates
Set up automated processes to periodically scan your media storage for duplicates. This ensures that duplicates are identified and removed regularly, reducing long-term clutter.
Track image uploads with unique IDs
When adding new images, assign a unique ID to each one in your database and store it alongside the image’s hash. This reduces the need for repeated hash generation during future comparisons.
Use lossy hash algorithms for faster detection
For large datasets, consider using faster, lossy hash algorithms like average hashing for an initial pass. Once potential duplicates are identified, you can switch to more accurate methods.
Consider near-duplicate detection for similar images
If you’re dealing with variations of the same image (e.g., different sizes or slight edits), implement a solution that detects near-duplicates based on visual similarities and not just exact matches.
Set thresholds based on image type
Adjust your duplicate detection threshold depending on the type of image. For example, logos may need an exact match, while product photos may tolerate slight variations in lighting or resolution.
Cache results of expensive comparisons
When processing large datasets, cache the results of expensive hash or pixel comparisons. If the same images appear later, you can quickly reference the cache instead of reprocessing them.
Integrate duplicate detection in image processing workflows
Embed duplicate detection into your image upload and processing pipelines. This ensures that duplicates are flagged or removed before they enter your main storage, keeping your system clean.
Monitor and optimize storage usage regularly
Use storage analytics to monitor how duplicate images affect your overall storage and performance. Regularly refining your duplicate detection thresholds and techniques based on usage can lead to significant storage cost savings.

Last updated: May 1, 2025