'Matched substrings from a List of strings [closed]
I have a List of Strings like this
List<String> list = Arrays.asList("Birth City", "City of Birth", "Location", "Geo Address", "Last Known Address");
I want to extract a List of Strings that are similar (by similar I mean any substrings matching).
I want Birth City, City of Birth, Geo address, Last known address as the first two has birth as a match and the next two has Address as a match.
I'm trying with this, but I know I'm missing something here
List<String> myList = Arrays.asList("Birth City", "City of Birth", "Location", "Geo Address", "Last Known Address");
myList.stream().forEach(line -> {
System.out.println(getMatchingSubstring(line,myList));
});
private static String getMatchingSubstring(String str, List<String> substrings) {
for (String substring : substrings) {
if (str.contains(substring)) {
return substring;
}
}
return null;
}
Solution 1:[1]
Your method getMatchingSubstring gets each element of myList, and myList itself, then checks for each string in the list - whether it contains the provided string. And since the provided string is always one of the list elements - it will be printed once.
Some suggestions:
- First thing I'd do is to pass the list without the current element (
line) to avoid comparison with the same string. - Instead of comparing strings - I'd split the string by the space character and compare to tokens to find intersection between strings. This is the similarity you're talking about.
- Print only the lines that didn't return as null from
getMatchingSubstring.
It will look something like that:
public class Main {
public static void main(String[] args) {
List<String> myList = Arrays.asList("Birth City", "City of Birth", "Location", "Geo Address", "Last Known Address");
myList.stream().forEach(line -> {
String result = getMatchingSubstring(line,
myList.stream().filter(element -> !element.equals(line)).collect(Collectors.toList()));
if (result != null){
System.out.println(result);
}
});
}
private static String getMatchingSubstring(String str, List<String> otherStrings) {
String[] strTokens = str.split(" ");
for (String substring : otherStrings) {
String[] subStringTokens = substring.split(" ");
if (intersection(strTokens, subStringTokens)) {
return substring;
}
}
return null;
}
private static boolean intersection(String[] a, String[] b) {
List<String> l1 = Arrays.asList(a);
List<String> l2 = Arrays.asList(b);
Set<String> result = l1.stream().distinct().filter(l2::contains).collect(Collectors.toSet());
return result.size() > 0;
}
}
Output:
City of Birth
Birth City
Last Known Address
Geo Address
Solution 2:[2]
To achieve your goal, you could first break down each String of your main List with a regex which splits by white spaces. Then, you could traverse your main List by index with an IntStream and for each broken down array corresponding to the i-th String verify if any of its words appear in other strings with a nested stream. If they do, then they pass the filter aggregate operation and then they're collected; otherwise they're discarded.
List<String> list = Arrays.asList("Birth City", "City of Birth", "Location", "Geo Address", "Last Known Address");
//Breaking down each String by white spaces
List<String[]> listSplitStr = list.stream()
.map(str -> str.split("\\s+"))
.collect(Collectors.toList());
List<String> listRes = IntStream.range(0, list.size()) //Traversing the list of broken down strings by index
.filter(i -> Arrays.stream(listSplitStr.get(i)).anyMatch(word -> //for each array of the broken down list we check if any of its words appear in the main list
IntStream.range(0, list.size()).anyMatch(j -> !list.get(i).equals(list.get(j)) && list.get(j).contains(word)))) //Traversing the main list again and making sure that the string we're iterating does not correspond to the current array and that the array's word is contained within the string
.mapToObj(i -> list.get(i)) //Retrieving the i-th string from the main list
.collect(Collectors.toList());
System.out.println(listRes);
This is the output as the one you expected:
Solution 3:[3]
if (str.contains(substring)) {
You are essentially comparing whole string against the list of input. Hence, returns true for all
As a naive approach, you might need to split the string based on white space and check
private static List<String> getMatchingSubstring(List<String> substrings) {
List<String> result = new ArrayList<>();
for (int i = 0; i < substrings.size(); i++) {
if (!result.contains(substrings.get(i))) {
String[] tokens = substrings.get(i).split("\\u0020");
for (String token : tokens) {
for (int j = i + 1; j < substrings.size(); j++) {
if (!result.contains(substrings.get(j)) && substrings.get(j).contains(token)) {
result.add(substrings.get(i));
result.add(substrings.get(j));
}
}
}
}
}
return result;
}
Further, you can check for string Fuzzy matching algorithms wiki in case of complex matching scenario.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | Yuvaraj R |


