'Find matching phrases of at least 5 words or longer between two texts in R
I have two vectors of text that I want to compare to identify any phrases that are at least 5 words long that are common between the two.
Some example text with a longer phrase copy and pasted from note1 into note2 and a short phrase copy and pasted from 2 into 1 that is too short to identify:
#read in the two documents to compare
note1 <- c("Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, It was the best")
note2 <- c("It was the best of times, it was the worst of times, it was the age of wisdom, Some years ago--never mind how long precisely")
What I want to identify is
"Some years ago--never mind how long precisely"
I can convert the two texts to 5-grams (with something like tidytext or quanteda) and then find the 5 grams that are in common but I want the longest phrase that is in common, that is at least 5 words long, rather than all of the potential 5-word long phrases that make up this longer phrase. I also don't want it to ignore punctuation or capitalization.
Edit: I would prefer something mechanical and easily explainable. Something like the logic: if an ordered 5 word phrase from note1 is also in note2, check if that duplication is more than those 5 words, then return that longest duplicated string. Repeat for all possible 5+ word phrases. I can imagine a process where I create many n-grams, 5-grams, 6-grams, 7-grams, etc and ask, do the 5-grams overlap? If so check to see if the 6-grams overlap and repeat until the longest n-gram is returned.
Solution 1:[1]
Maybe you are looking for the smith waterman local alignment algorithm
library(text.alignment)
note1 <- c("Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, It was the best")
note2 <- c("It was the best of times, it was the worst of times, it was the age of wisdom, Some years ago--never mind how long precisely")
alignment <- smith_waterman(note1, note2, type = "words", tokenizer = function(x) unlist(strsplit(x, split = " ")))
alignment
#> Swith Waterman local alignment score: 12
#> ----------
#> Document a
#> ----------
#> Some years ago--never mind how long
#> ----------
#> Document b
#> ----------
#> Some years ago--never mind how long
as.data.frame(alignment)
#> a
#> 1 Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, It was the best
#> b
#> 1 It was the best of times, it was the worst of times, it was the age of wisdom, Some years ago--never mind how long precisely
#> sw similarity matches mismatches a_n a_aligned
#> 1 12 0.2857143 6 0 21 Some years ago--never mind how long
#> a_similarity a_gaps a_from a_to a_fromto a_misaligned b_n
#> 1 0.2857143 0 4 9 Some, ye.... c("Call".... 25
#> b_aligned b_similarity b_gaps b_from b_to
#> 1 Some years ago--never mind how long 0.24 0 19 24
#> b_fromto b_misaligned
#> 1 Some, ye.... c("It", ....
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | user13818093 |
