'How to find all longest common substrings that exist in multiple documents?

I have many text documents that I want to compare to one another and remove all text that is exactly the same between them. This is to remove find boiler plate text that is consistent so it can be removed for NLP.

The best way I figured to do this is to find Longest Common Sub-strings that exist or are mostly present in all the documents. However, doing this has been incredibly slow.

Here is an example of what I am trying to accomplish:

DocA:

Title: To Kill a Mocking Bird
Author: Harper Lee
Published: July 11, 1960

DocB:

Title: 1984
Author: George Orwell
Published: June 1949

DocC:

Title: The Great Gatsby
Author: F. Scott Fitzgerald

The output would show something like:

{
    'Title': 3,
    'Author': 3,
    'Published': 2,
}

The results would then be used to strip out the commonalities between documents.

Here is some code I have tested in python. It's incredibly with any significant amount of permutations:

file_perms = list(itertools.permutations(files, 2))
results = {}
for p in file_perms:
    doc_a = p[0]
    doc_b = p[1]

    while True:
        seq_match = SequenceMatcher(a=doc_a, b=doc_b)
        match = seq_match.find_longest_match(0, len(doc_a), 0, len(doc_b)) 

        if (match.size >= 5): 
            doc_a_start, doc_a_stop = match.a, match.a + match.size
            doc_b_start, doc_b_stop = match.b, match.b + match.size 
            match_word = doc_a[doc_a_start:doc_a_stop]

            if match_word in results:
                results[match_word] += 1
            else:
                results[match_word] = 1

            doc_a = doc_a[:doc_a_start] + doc_a[doc_a_stop:]
            doc_b = doc_b[:doc_b_start] + doc_b[doc_b_stop:]
        else: 
            break 

df = pd.DataFrame(
    {
        'Value': [x for x in results.keys()],
        'Count': [x for x in results.values()]
    }
)
print(df)


Solution 1:[1]

create a set from each document, build a counter for every word how many time it appears iterate over every document, when you find a word that appears in 70% -90% of documents, append it and the word after it as a tuple to a new counter and again..

 from collections import Counter

 one_word = Counter()
 for doc in docs:
     word_list = docs.split(" ")
     word_set = set(word_list)
     for word in word_set:
         one_word[word]+=1

 two_word = Counter()
 threshold = len(docs)*0.7

 for doc in docs:
     word_list = doc.split(" ")
     for i in range(len(word_list)-1):
         if one_word[word_list[i]]>threshold:
             key = (word_list[i], word_list[i+1])

you can play with the threshold and continue as long as the counter is not empty

the docs are lyrics of songs believer, by the river of Babylon, I could stay awake, rattlin bog

from collections import Counter
import os
import glob
TR =1 #threshold
dir = r"D:\docs"
path = os.path.join(dir,"*.txt")
files = glob.glob(path)
one_word = {}
all_docs = {}
for file in files:
    one_word[file] = set()
    all_docs[file] = []
    with open(file) as doc:
        for row in doc:
            for word in row.split():
                one_word[file].add(word)
                all_docs[file].append(word)
#now one_word is a dict where the kay is file name and the value is set of words in it
#all_docs is a dict file name is the key and the value is the complete doc stord in a list word by word
common_Frase = Counter()
for key in one_word:
    for word in one_word[key]:
        common_Frase[word]+=1
#common_Frase containe a count of all words appearence in all files (every file can add a word once)

two_word = {}
for key in all_docs:
    two_word[key] = set()
    doc = all_docs[key]
    for index in range(len(doc)-1):
        if common_Frase[doc[index]]>TR:
            val = (doc[index], doc[index+1])
            two_word[key].add(val)

for key in two_word:
    for word in two_word[key]:
        common_Frase[word]+=1
#now common_Frase contain a count of all two words frase

three_word = {}
for key in all_docs:
    three_word[key] = set()
    doc = all_docs[key]
    for index in range(len(doc)-2):
        val2 = (doc[index], doc[index+1])
        if common_Frase[val2]>TR:
            val3 = (doc[index], doc[index+1], doc[index+2])
            three_word[key].add(val3)


for key in three_word:
    for word in three_word[key]:
        common_Frase[word]+=1

for k in common_Frase:
    if common_Frase[k]>1:
        print(k)

this is the outpot

when like all Don't And one the my hear and feeling Then your of I'm in me The you away I never to be what a ever thing there from By down Now words that was ('all', 'the') ('And', 'the') ('the', 'words') ('By', 'the') ('and', 'the') ('in', 'the')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1