'Find matching phrases of at least 5 words or longer between two texts in R

I have two vectors of text that I want to compare to identify any phrases that are at least 5 words long that are common between the two.

Some example text with a longer phrase copy and pasted from note1 into note2 and a short phrase copy and pasted from 2 into 1 that is too short to identify:

#read in the two documents to compare
note1 <- c("Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, It was the best")
note2 <- c("It was the best of times, it was the worst of times, it was the age of wisdom, Some years ago--never mind how long precisely")

What I want to identify is

"Some years ago--never mind how long precisely"

I can convert the two texts to 5-grams (with something like tidytext or quanteda) and then find the 5 grams that are in common but I want the longest phrase that is in common, that is at least 5 words long, rather than all of the potential 5-word long phrases that make up this longer phrase. I also don't want it to ignore punctuation or capitalization.

Edit: I would prefer something mechanical and easily explainable. Something like the logic: if an ordered 5 word phrase from note1 is also in note2, check if that duplication is more than those 5 words, then return that longest duplicated string. Repeat for all possible 5+ word phrases. I can imagine a process where I create many n-grams, 5-grams, 6-grams, 7-grams, etc and ask, do the 5-grams overlap? If so check to see if the 6-grams overlap and repeat until the longest n-gram is returned.



Solution 1:[1]

Maybe you are looking for the smith waterman local alignment algorithm

library(text.alignment)
note1 <- c("Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, It was the best")
note2 <- c("It was the best of times, it was the worst of times, it was the age of wisdom, Some years ago--never mind how long precisely")
alignment <- smith_waterman(note1, note2, type = "words", tokenizer = function(x) unlist(strsplit(x, split = " ")))
alignment
#> Swith Waterman local alignment score: 12
#> ----------
#> Document a
#> ----------
#> Some years ago--never mind how long
#> ----------
#> Document b
#> ----------
#> Some years ago--never mind how long
as.data.frame(alignment)
#>                                                                                                                        a
#> 1 Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, It was the best
#>                                                                                                                              b
#> 1 It was the best of times, it was the worst of times, it was the age of wisdom, Some years ago--never mind how long precisely
#>   sw similarity matches mismatches a_n                           a_aligned
#> 1 12  0.2857143       6          0  21 Some years ago--never mind how long
#>   a_similarity a_gaps a_from a_to     a_fromto a_misaligned b_n
#> 1    0.2857143      0      4    9 Some, ye.... c("Call"....  25
#>                             b_aligned b_similarity b_gaps b_from b_to
#> 1 Some years ago--never mind how long         0.24      0     19   24
#>       b_fromto b_misaligned
#> 1 Some, ye.... c("It", ....

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 user13818093