'find common files between two directories - exclude file extension

I have two directories with files that end in two different extensions:

Folder A called profile (1204 FILES)

file.fasta.profile
file1.fasta.profile
file2.fasta.profile


Folder B called dssp (1348 FILES)


file.dssp
file1.dssp
file2.dssp
file3.dssp #<-- odd one out

I have some files in folder B that are not found in folder A and should be removed for example file3.profile would be deleted as it is not found in folder A. I just want to retain those that are common in their filename, but excluding extension to end up with 1204 files in both I saw some bash lines using diff but it does not consider this case, where the ones I want to remove are those that are not found in the corresponding other file.



Solution 1:[1]

Python version:

EDIT: now suports multiple extensions

#!/usr/bin/python3

import glob, os

def removeext(filename):
    index = filename.find(".")
    return(filename[:index])

setA = set(map(removeext,os.listdir('A')))
print("Files in directory A: " + str(setA))

setB = set(map(removeext,os.listdir('B')))
print("Files in directory B: " + str(setB))

setDiff = setA.difference(setB)
print("Files only in directory A: " + str(setDiff))

for filename in setDiff:
    file_path = "A/" + filename + ".*"
    for file in glob.glob(file_path):
        print("file=" + file)
        os.remove(file)

Does pretty much the same as my bash version above.

  • list files in A
  • list files in B
  • get the list of differences
  • delete the differences from A

Test output, done on Linux Mint, bash 4.4.20

mint:~/SO$ l
drwxr-xr-x 2 Nic3500 Nic3500 4096 May 10 10:36 A/
drwxr-xr-x 2 Nic3500 Nic3500 4096 May 10 10:36 B/

mint:~/SO$ l A
total 0
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file1.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file2.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:14 file3.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:36 file4.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file.fasta.profile
mint:~/SO$ l B
total 0
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:05 file1.dssp
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file2.dssp
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file3.dssp
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:05 file.dssp


mint:~/SO$ ./so.py
Files in directory A: {'file1', 'file', 'file3', 'file2', 'file4'}
Files in directory B: {'file1', 'file', 'file3', 'file2'}
Files only in directory A: {'file4'}
file=A/file4.fasta.profile


mint:~/SO$ l A
total 0
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file1.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file2.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:14 file3.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file.fasta.profile

Solution 2:[2]

Try this Shellcheck-clean Bash program:

#! /bin/bash -p

folder_a=PATH_TO_FOLDER_A
folder_b=PATH_TO_FOLDER_B

shopt -s nullglob
for ppath in "$folder_a"/*.profile; do
    pfile=${ppath##*/}
    dfile=${pfile%.profile}.dssp
    dpath=$folder_b/$dfile
    [[ -f $dpath ]] || echo rm -v -- "$ppath"
done
  • It currently just prints what it would do. Remove the echo once you are sure that it will do what you want.
  • shopt -s nullglob makes globs expand to nothing when nothing matches (otherwise they expand to the glob pattern itself, which is almost never useful in programs).
  • See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for information about the string manipulation mechanisms used (e.g. ${ppath##*/}).

Solution 3:[3]

With find:

find 'folder A' -type f -name '*.fasta.profile' -exec sh -c \
'! [ -f "folder B/$(basename -s .fasta.profile "$1").dssp" ]' _ {} \; -print

Replace -print by -delete when you will be convinced that it does what you want.

Or, maybe a bit faster:

find 'folder A' -type f -name '*.fasta.profile' -exec sh -c \
'for f in "$@"; do [ -f "folder B/$(basename -s .fasta.profile "$f").dssp" ] || echo rm "$f"; done' _ {} +

Remove echo when you will be convinced that it does what you want.

Solution 4:[4]

Here is a way to do it:

  • for both A and B directories, list the files under each directory, without the extension.
  • compare both lists, show only the file that does not appear in both.

Code:

#!/bin/bash

>a.list
>b.list

for file in A/*
do
    basename "${file%.*}" >>a.list
done

for file in B/*
do
    basename "${file%.*}" >>b.list
done

comm -23 <(sort a.list) <(sort b.list) >delete.list

while IFS= read -r line; do
    rm -v A/"$line"\.*
done < "delete.list"

# cleanup
rm -f a.list b.list delete.list
  • "${file%.*}" removes the extension
  • basename removes the path
  • comm -23 ... shows only the lines that appear only in a.list

EDIT May 10th: my initial code listed the file, but did not delete it. Now it does.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 pjh
Solution 3
Solution 4