'What is a more efficient way to find matches of strings in a NumPy array as substrings in another array?

I've got two NumPy arrays containing strings. I want to use each string in the first array to search in the second array if this string is contained as a substring there.

As a very easy example:

import numpy as np

result_list = []
array_1 = np.array(['ab', 'fo', 'ba'])
array_2 = np.array(['lab', 'abc', 'zwf', 'foo', 'bar'])

for word_to_search in array_1:
    for potential_word in array_2:
        if word_to_search in potential_word:
            result_list.append(potential_word)

# delete duplicates
result_list = list(set(result_list))

# result_list = ['lab', 'abc', 'foo', 'bar']

I tried it with basic Python lists and also with NumPy arrays. The latter are much better due to performance reasons but I still think that there must be a better solution.

As my array_1 has about 11,000,000 entries and my array_2 has about 300,000 entries I need to have a very performant approach, which is not the case for my current solution.



Solution 1:[1]

You can employ the existing short-circuit methods any and all; here, you need only any.

result_list = [word for word in array_2
               if any(digram in word for digram in array_1)]

Result:

['lab', 'abc', 'foo', 'bar']

You could also use filter:

result_list = list(filter(lambda word: 
                           any(digram in word for digram in array_1), 
                       array_2)

Solution 2:[2]

I don't know a slick trick to make it much faster, but I have some slight improvement to offer. Using numpy.arrays has no effect in this case because you're manually looping over the single entries anyways. The arrays might only offer an advantage if there are some numpy methods that can help us with your task. For now I would suggest removing entries from list_2 once they were matched and using a while loop to check if list_2 still has items.

EDIT

Using set logic right from the beginning actually seems even a bit faster. I've added the method.

Here are some rudimentary speed comparisons and included @Prune's suggestions as well:

from time import time


def original(list_1, list_2, result_list):
    for word_to_search in list_1:
        for potential_word in list_2:
            if word_to_search in potential_word:
                result_list.append(potential_word)
    return list(set(result_list))


def while_loop(list_1, list_2, result_list):
    i = 0
    while i < len(list_1) and list_2:
        word_to_search = list_1[i]
        for potential_word in list_2:
            if word_to_search in potential_word:
                result_list.append(potential_word)
                list_2.remove(potential_word)
        i += 1

    return result_list


def set_(list_1, list_2, result_list):
    result_set = set()
    for word_to_search in list_1:
        for potential_word in list_2:
            if word_to_search in potential_word:
                result_set.add(potential_word)
    return list(result_set)


def any_(list_1, list_2, result_list):
    return [
        word for word in list_2 if any(digram in word for digram in list_1)
    ]


def filter_(list_1, list_2, result_list):
    return list(
        filter(lambda word: any(digram in word for digram in list_1), list_2)
    )


methods = {
    'original': original,
    'while_loop': while_loop,
    'any': any_,
    'filter': filter_,
}

result_list = []
list_1 = ['ab', 'fo', 'ba']
list_2 = ['lab', 'abc', 'zwf', 'foo', 'bar']
for name, method in methods.items():
    start = time()
    for i in range(10000):
        method(list_1, list_2, result_list)
    stop = time()
    print(f'{name}: {stop-start}')

Returns:

original: 1.475600004196167
while_loop: 0.006448030471801758
set: 0.004959821701049805
any: 0.006943941116333008
filter: 0.008928060531616211

Depending on your actual data this may be completely off though.

EDIT

I feel like there may be some nifty way to use list_2.__str__() and some string methods, which would be very fast, but I just can't think of one at the moment. You could also use regex, but that would probably be way to slow.

Solution 3:[3]

After the step result.append(potential_word) you can break out of that for loop, in this way you will have no duplicates and will definitely save time. Also you need to change the order of the loops for that, outer loop loops through array_2 and the inner through array_1.

import numpy as np

result_list = []
array_1 = np.array(['ab', 'fo', 'ba'])
array_2 = np.array(['lab', 'abc', 'zwf', 'foo', 'bar'])

for potential_word in array_2:
    for word_to_search in array_1:
        if word_to_search in potential_word:
            result_list.append(potential_word)
            break

print(result_list)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Prune
Solution 2
Solution 3 mkrieger1