'Look up multiple words in a large string, with each word being at most k words apart from another word

Imagine you are given a long string of text - for e.g. "The quick brown fox jumped over the lazy dog and lorem ipsum etc", and then you are given an array of words like ["quick", "brown"], and an int k. How do I write a query function which return to me a range defined by start and end in the long string text where text[start] and text[end] are both words in words, and within start and end in text, all of the words in words occur, at most k words apart?

I expect to run this query function multiple times for a given long string text, so I'm hoping for an algorithm that is O(f + s + t + ...) where f is the number of occurrences of the first word in words in text, s is the number of occurrences of the second word in words in text and so forth.

What I have is the following -


def query(text: str, words: list, k: int):
    word2idx_list = get_word2idx_list(text)

def get_word2idx_list(
    text: str,
):
    word2idx_list = defaultdict(list)
    words = text.split(" ")
    for i, word in enumerate(words):
        word2idx_list[word].append(i)
    return dict(word2idx_list)

but after this I get stuck - do I use a sliding window?



Solution 1:[1]

algorithm sketch

  • inputs:
    • k = 1 # max separation between words
    • haystack = 'the quick ... lorem ipsum'.split()
    • needles = set('dog fox lazy'.split())
  • state:
    • candidates = set() # tuples of 1 or more word indexes

We win if we ever discover a tuple s.t. len(candidate) == len(needles).

1st invariant: For each index i stored within each of the candidates, it will be the case that haystack[i] in needles.

2nd invariant: When scanning a sorted(candidate), distance between adjacent indexes will be at most k.

preprocessing

To get the ball rolling, do a linear scan over the haystack. At each index we

  • ask whether haystack[i] in needles
  • if so, insert several 1-tuples:
    • for j in range(i - k, i + k + 1):
      • candidates.add((j,))

winnow candidates

Now expand those candidates.

  • while there are candidates of length less than len(needles):
    • choose an arbitrary candidate, removing it from the set
    • lo = min(candidate) - k
    • hi = max(candidate) + k
    • make a temporary ndls = needles.copy()
    • for i in candidate:
      • ndls.remove(haystack[i])
    • Is ndls empty? Oh, good, we're done! Report the match.
    • At this point, ndls has words we seek within the lo .. hi range.
    • for i in range(lo, hi + 1):
      • if haystack[i] in ndls:
        • for j in range(i - k, i + k + 1):
          • candidates.add(candidate + (j,))

So we have removed one candidate, and optionally added candidates which increment its length.

Handwaving: I have ignored running off the left / right sides of the haystack. Add a range() helper, or consider padding both sides with k sentinel words which will never appear as a needle.


It's easy to exit early if only 1st match is needed.

To report all matches, keep looping until set of candidates is empty. Its size will go to zero, even if there's no valid matches.

If there are H haystack words and N needles, then it costs O(N) to construct the needle set and O(H) for the preprocessing scan. After that the match rate will dominate, with worst case cost of O(k × N) which you might view as O(N). That's for a haystack largely composed of repeated copies of the needle set, but you wouldn't need a fancy algorithm for such an input. Realistic inputs will see a running time proportional to number of candidates generated, which depends on both k and the match rate.


alternate algorithm

Here's a much simpler approach that doesn't follow the distance requirement you outlined, but might suit your needs.

  • needles = { word: -math.inf for word in 'dog fox lazy'.split() }
  • for i, word in enumerate(haystack):
    • if word in needles:
      • needles[word] = i
      • if min(needles.values()) >= i - k:
        • report a match!

Consider post-processing those reported matches if too-wide gaps between needles will pose a problem, perhaps by examining sorted(needles.values()) deltas.

Solution 2:[2]

This is from an old answer here:

Install the browser buffer package:

npm install --save buffer

Import it and use it directly, example:

import {Buffer} from 'buffer';
Buffer.from('anything','base64');

ref:
Uncaught ReferenceError: Buffer is not defined in React

Solution 3:[3]

I solved this by downgrading "react-scripts": "5.0.0" to "react-scripts": "4.0.3" in package.json, and running npm install. But I think there's much better solution.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Dobromir Kirov
Solution 3 szemeg