'Algorithm to detect if a file name list has matching image file name?
I have a lot of files. Some are archives that are either zip or rar files. Some of them have images with matching name.
For example:
d:\\data\\archive1.zip
d:\\data\\archive1.jpg
d:\\data\\archive2.rar
d:\\data\\archive2.png
d:\\data\\archive3.zip [This one doesn't have an image]
Image extensions are png, jpg, jpeg, webp, gif, etc.
I need a way to differentiate those files that has an image pair of same name and those that doesn't have such pair. These file paths are in a list of strings. I tried doing the following getting only path without the extension:
for i in range(len(file_path_list)):
print(file_path_list[i])
fname = file_path_list[i].rsplit('.',1)[0]
tosearch_list = list(file_path_list)
tosearch_list.pop(i)
for x in tosearch_list:
if x.rsplit('.',1)[0] == fname:
print(f"{x} is a matching file")
Then for each entry in the list I would have to search the remaining entry in the list without extension to find the matches.
archive1 would have to search through remaining entries in the list and find the other files with same name. The names are not in any particular order. Is this the fastest way or are there any better ways?
Solution 1:[1]
You can create a dict for zip files and image files, and get the missing names via set operation. Here is an example:
file_path_list=[
"d:\\data\\archive1.zip",
"d:\\data\\archive1.jpg",
"d:\\data\\archive2.rar",
"d:\\data\\archive2.png",
"d:\\data\\archive3.zip",
"d:\\data\\some.bad.filename.zip",
]
zip_dict = {}
img_dict = {}
for path in file_path_list:
fname, fext = os.path.splitext(path)
if fext in (".zip", ".rar"):
zip_dict[fname] = path
elif fext in (".jpg", ".png"):
img_dict[fname] = path
zip_no_img_names = set(zip_dict.keys()).difference(set(img_dict.keys()))
soln_files = [zip_dict[k] for k in zip_no_img_names]
# ['d:\\data\\archive3.zip', 'd:\\data\\some.bad.filename.zip']
Solution 2:[2]
- Cut off an extension part (i.e.
.png), the simplest way is to move backward from the end towards the beginning until you see the first.char. - Once cut, keep either
dictorsetwhere the key is the remainingstring- a file name without extension to track repetitions efficiently. - If such repetition is found, then you got a match. In case you care about extensions and matches that could come in different combinations, you might want to use
dictwhere the key isstringand the value of typelistof found extensions. For example if you had bothmy_file.pngandmy_file.txt,my_dict['my_file'] == {'.png', '.txt'}.
Solution 3:[3]
Here's how to do it by using the itertools.groupby() function in conjunction with pathlib.Path instances to hold the file paths (to make them easier to deal with).
The code first convert the filepath to Paths, and then groups them by sorting them and ignoring the file extensions. Next it uses the groupby() function group the sorted list into sub-lists of files sharing a common path. Once that's done, it prints out each groups and an indicator of whether one of them is an image file.
from itertools import groupby
from pathlib import Path
from pprint import pprint
image_extensions = {'.png', '.jpg', '.jpeg', '.webp', '.gif'}
archive_extensions = {'.zip', '.rar'}
allowed_extensions = image_extensions | archive_extensions
# Raw filepaths in random order.
filepaths = ['d:\\data\\archive1.zip',
'd:\\data\\archive3.zip',
'd:\\data\\archive2.rar',
'd:\\data\\archive2.png',
'd:\\data\\archive1.jpg']
filepaths = [Path(filepath) for filepath in filepaths] # Convert to pathlib.Paths
filepaths = [filepath for filepath in filepaths if filepath.suffix in allowed_extensions]
def sort_key(filepath):
return ''.join(filepath.parts[:-1]), filepath.parts[-1]
filepaths.sort(key=sort_key)
#pprint(filepaths)
def keyfunc(filepath):
return ''.join(filepath.parts[:-1]), filepath.stem
groups = []
for k, g in groupby(filepaths, keyfunc):
groups.append(list(g)) # Append filepath group.
for group in groups:
has_image = any((filepath.suffix in image_extensions) for filepath in group)
print(group, 'Has image' if has_image else 'No image')
Sample output:
[WindowsPath('d:/data/archive1.jpg'), WindowsPath('d:/data/archive1.zip')] Has image
[WindowsPath('d:/data/archive2.png'), WindowsPath('d:/data/archive2.rar')] Has image
[WindowsPath('d:/data/archive3.zip')] No image
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Zazeil |
| Solution 3 |
