'How to check duplicate pictures in a very large dataset using Python?
I have a dataset which contains million level images and what I want to do is to do something like Counter(<list of images>) to check duplicates and count for the whole dataset. However, considering the size of images, it seems infeasible to load all into memory. Thus, is there any way to do things like this? Do I need to write my own hash function and reverse dict?
Edited for sha1:
I did something like
image = Image.open("x.jpg") # PIL library
hashlib.sha1(image)
and got an error like
TypeError: object supporting the buffer API required
What should I do now?
Solution 1:[1]
As suggested, you can use any hashing function, and use it to digest the image file as a binary. Then save the digest in a dictionary and use that to count duplicates (or store more information if you wish).
At the very basic, for each image you would do something like:
import hashlib
filename = "x.jpg"
hashstr = hashlib.sha1(open(filename).read()).hexdigest()
That would return a hex string in hashstr, like 5fe54dee8f71c9f13579f44c01aef491e9d6e655
As pointed out, this only works if the duplication is at the file level, byte-per-byte. If you want to weed out the same image, let's say at different resolutions, or different dimensions, the hashlib functions cannot help, and you would need to find a different way to determine equality.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | sal |
