'Reduce time in searching a list of strings with Regex

I have 130,000 records of strings. I want to find the number of occurance of each string element in those 130,000 records by doing the regex search for each record instead of simple search. Currently I am doing it with kind of simple linear search. Which is taking hours. What are some other approaches that can reduce this time?

The method I followed is: I took a string element from 130,000 records, and did regex search (linear) then I count the findings against that picked string element. And I did the same for each element.

So there are 130,000 iterations for each string element. Total iterations will be 130,000 * 130,000 = some huge number.

How can I reduce these iterations?



Solution 1:[1]

I'm not sure I understand the question completely. I'm confused about what the records are and what you mean by finding occurrences in the records.

If that is the case, and you do the simplest approach, maybe something like

records = [
    "foo", "bar", "foobar", "baz", "bar", "qux"
]

for rec in records:
    count = sum(x == rec for x in records)
    print(count)

it takes O(n²m) if n is the number of records and m the average length of the strings.

If you put the strings in a (hash) set, you can update it in time O(m) (to hash and compare) so something like

from collections import Counter
print(Counter(records))

only takes O(nm).

I suspect that the problem is harder than this, though.

Is the problem to find how often one record occurs as a substring of another?

That you can do with a (generalised) suffix tree or suffix array. Concatenate your records "foo[1]bar[2]foobar[3]baz[4]bar[5]qux[6]" with sentinel characters [i] between them. You need 130,000 sentinels which will be a slight problem, but if you remap the alphabet to put the real letters in the range 0..K you can use integers K.. as sentinels. You will want to a construction algorithm that is independent of alphabet size then.

Anyway, each record will be represented as a leaf that ends in a sentinel. That record occurs more than once if the edge going to that leaf only has the sentinel on it, and the number of occurrences is the size of the sub-tree that is rooted at the leaf's parent.

It takes O(nm) to build the suffix tree--this is optimal as that is the size of your data. If you traverse the tree and annotate each node with the size of the sub-tree, that will take time O(nm) as well. Then, in another O(nm) traversal you can identify all the relevant leaves and report the number of occurrences. All in all O(nm) but now reporting how often each record appears as a substring of another.

Update

I had a little time between meetings to write a rough muck-up of the idea. The complete code is below, and I use a library I use for teaching, pystr, to build the suffix tree. This isn't production code, so it probably isn't fast enough, but it should give a hint at the idea.

Anyway, I would take all the records and make a single string out of them, so I can build a suffix tree. I took records such as "foo", "bar", "foobar" and translated it into the string [1]foo[2]bar[3]foobar where [i] is the integer i.

BookKeeping = namedtuple("BookKeeping", "k x recs")


def remap(recs: list[str]) -> BookKeeping:
    """
    Map records to a single string.

    The strings arre mapped to an alphabet that
    starts at the integer k+1, where k is the number
    of records. Each record i has the unique letter i
    in front of it. I count records from 1 since 0
    is used as a sentinel in the ST construction.

    With this construction, a leaf with index i is a
    record if x[i-1] == <i> and we always have a branch
    just after a record.
    """
    k = len(recs)
    letters = set.union(*(set(rec) for rec in recs))
    alphabet = {a: i+k+1 for i, a in enumerate(letters)}
    # The recid is a hack to map indices to records. You can
    # do better than this.
    x: list[int] = []
    recid: list[Optional[int]] = []
    for i, rec in enumerate(recs):
        x.append(i+1)
        recid.append(None)
        x.extend(alphabet[a] for a in rec)
        recid.extend([i+1] * len(rec))
    return BookKeeping(k, x, recid)

I start at 1 because my library uses 0 as its terminal in the suffix tree, so that will not be a unique letter. All the others will be, and that is what I then exploit. If I have the record i, the string is [i]...[i+1] (or [i]...[0] for the final record), and in the suffix tree I am interested in the string ...[i+1]. Since [i+1] is a unique letter, there will always be a branch there, going down to a leaf. So I can check if I am looking at a leaf the represents a full record with

def is_rec_index(bk: BookKeeping, n: Leaf) -> bool:
    """
    Check if n's index is into a record.

    It is a record index if it is an index into one of the
    original records, and it is that if it isn't any of
    the record sentinel letters or the terminal sentinel.
    """
    return 0 < n.leaf_label < len(bk.x) and \
        bk.x[n.leaf_label] > bk.k


def is_record(bk: BookKeeping, n: Leaf) -> bool:
    """
    Check if index i is representing a record.

    We have a record if we start at an index just to
    the right of a record letter, i.e., x[i-1] <= k,
    and we are sitting on an edge that starts with
    the following record letter.
    """
    if not is_rec_index(bk, n):
        return False
    rec = bk.x[n.leaf_label - 1]
    if rec > bk.k:
        return False
    # We started at the first letter in a record,
    # so check if we branched at the next rec letter
    return n.edge_label[0] == rec + 1

The rest of the algorithm is just annotating the suffix tree with what information I need and then reporting for each record leaf. The parent of a record leaf is the subtree that starts with the record, and all records that contain it will be in this subtree.

I've implemented two algorithms. One is fast, and counts how many times a record appears outside itself, and the other is slow but reports which records it appears in. This is the annotation function:

def annotate(bk: BookKeeping, n: Node, depth: int = 0) -> None:
    """
    Annotates the tree's nodes with the info we need to
    collect the record information.
    """
    # FIXME: You should probably use an explicit stack here.
    # You can easily run out of recursion stack.
    n.depth = depth
    if isinstance(n, Leaf):
        n.is_record = is_record(bk, n)
        n.recs = get_record(bk, n)         # collect the records represented
        n.size = int(is_rec_index(bk, n))  # a real index counts as one
    if isinstance(n, Inner):
        # Recursively annotate
        for child in n.children.values():
            annotate(bk, child, depth + len(child.edge_label))

        # Compute the size. This will in total run in time the
        # size of the suffix tree, which is optimal
        n.size = 0
        for child in n.children.values():
            n.size += child.size

        # FIXME: You need something smarter here! This can take
        # time O(n) for each of O(nm) inner nodes, where n is the
        # number of records. A bit-vector can cut this to O(n/w) where
        # w is the word size, but it is still O(n²m/w) and n is large.
        n.recs = set.union(*(child.recs for child in n.children.values()))

If I only need to know how many leaves each node has, I can do it in constant time per node, but to collect the records that appear in subtrees I have to merge sets. You should be able to fix this if you do the collection and annotation in the same traversal, so you can drag a set along with you. If you add to the set as you traverse the tree, collect what you need, and then move up the recursion, I don't think you have to merge sets. But I don't have time to implement this right now.

Anyway, after the node annotation you can count or collect like this:


def collect(bk: BookKeeping,
            n: Node,
            res: None | dict[int, list[int]] = None
            ) -> dict[int, list[int]]:
    """Collect which records any given record appear as a sub-string in."""
    res = res if res is not None else {}

    # FIXME: You should probably use an explicit stack here.
    # You can easily run out of recursion stack.
    if isinstance(n, Leaf) and n.is_record:
        res[bk.x[n.leaf_label-1]] = n.parent.recs
    if isinstance(n, Inner):
        for child in n.children.values():
            collect(bk, child, res)
    return res  # we return this just for convinience


def count(bk: BookKeeping,
          n: Node,
          res: None | dict[int, int] = None
          ) -> dict[int, int]:
    """Count how many records contain each rec as a substring."""
    res = res if res is not None else {}

    # FIXME: You should probably use an explicit stack here.
    # You can easily run out of recursion stack.
    if isinstance(n, Leaf) and n.is_record:
        res[bk.x[n.leaf_label-1]] = n.parent.size - 1  # don't count itself
    if isinstance(n, Inner):
        for child in n.children.values():
            count(bk, child, res)
    return res  # we return this just for convinience

The complete code is below. I hope it makes sense; otherwise, feel free to ask.

"""Computing record summaries."""

from typing import Optional
from collections import namedtuple
from pystr.suffixtree import (
    SuffixTree, Node, Inner, Leaf,
    mccreight_st_construction
)

BookKeeping = namedtuple("BookKeeping", "k x recs")


def remap(recs: list[str]) -> BookKeeping:
    """
    Map records to a single string.

    The strings arre mapped to an alphabet that
    starts at the integer k+1, where k is the number
    of records. Each record i has the unique letter i
    in front of it. I count records from 1 since 0
    is used as a sentinel in the ST construction.

    With this construction, a leaf with index i is a
    record if x[i-1] == <i> and we always have a branch
    just after a record.
    """
    k = len(recs)
    letters = set.union(*(set(rec) for rec in recs))
    alphabet = {a: i+k+1 for i, a in enumerate(letters)}
    # The recid is a hack to map indices to records. You can
    # do better than this.
    x: list[int] = []
    recid: list[Optional[int]] = []
    for i, rec in enumerate(recs):
        x.append(i+1)
        recid.append(None)
        x.extend(alphabet[a] for a in rec)
        recid.extend([i+1] * len(rec))
    return BookKeeping(k, x, recid)


def build_st(x: list[int]) -> SuffixTree:
    """Build a suffix tree."""
    # Don't use McCreight, it doesn't handle very large
    # alphabets well... there are better algorithms, I just
    # don't have an implementation.
    return mccreight_st_construction(x)


def is_rec_index(bk: BookKeeping, n: Leaf) -> bool:
    """
    Check if n's index is into a record.

    It is a record index if it is an index into one of the
    original records, and it is that if it isn't any of
    the record sentinel letters or the terminal sentinel.
    """
    return 0 < n.leaf_label < len(bk.x) and \
        bk.x[n.leaf_label] > bk.k


def is_record(bk: BookKeeping, n: Leaf) -> bool:
    """
    Check if index i is representing a record.

    We have a record if we start at an index just to
    the right of a record letter, i.e., x[i-1] <= k,
    and we are sitting on an edge that starts with
    the following record letter.
    """
    if not is_rec_index(bk, n):
        return False
    rec = bk.x[n.leaf_label - 1]
    if rec > bk.k:
        return False
    # We started at the first letter in a record,
    # so check if we branched at the next rec letter
    return n.edge_label[0] == rec + 1


def get_record(bk: BookKeeping, n: Leaf) -> set[int]:
    """Get a set (singleton or empty) of the records in n."""
    if n.leaf_label < len(bk.x) and bk.recs[n.leaf_label] is not None:
        return {bk.recs[n.leaf_label]}
    return set()


def annotate(bk: BookKeeping, n: Node, depth: int = 0) -> None:
    """
    Annotates the tree's nodes with the info we need to
    collect the record information.
    """
    # FIXME: You should probably use an explicit stack here.
    # You can easily run out of recursion stack.
    n.depth = depth
    if isinstance(n, Leaf):
        n.is_record = is_record(bk, n)
        n.recs = get_record(bk, n)         # collect the records represented
        n.size = int(is_rec_index(bk, n))  # a real index counts as one
    if isinstance(n, Inner):
        # Recursively annotate
        for child in n.children.values():
            annotate(bk, child, depth + len(child.edge_label))

        # Compute the size. This will in total run in time the
        # size of the suffix tree, which is optimal
        n.size = 0
        for child in n.children.values():
            n.size += child.size

        # FIXME: You need something smarter here! This can take
        # time O(n) for each of O(nm) inner nodes, where n is the
        # number of records. A bit-vector can cut this to O(n/w) where
        # w is the word size, but it is still O(n²m/w) and n is large.
        n.recs = set.union(*(child.recs for child in n.children.values()))


def collect(bk: BookKeeping,
            n: Node,
            res: None | dict[int, list[int]] = None
            ) -> dict[int, list[int]]:
    """Collect which records any given record appear as a sub-string in."""
    res = res if res is not None else {}

    # FIXME: You should probably use an explicit stack here.
    # You can easily run out of recursion stack.
    if isinstance(n, Leaf) and n.is_record:
        res[bk.x[n.leaf_label-1]] = n.parent.recs
    if isinstance(n, Inner):
        for child in n.children.values():
            collect(bk, child, res)
    return res  # we return this just for convinience


def count(bk: BookKeeping,
          n: Node,
          res: None | dict[int, int] = None
          ) -> dict[int, int]:
    """Count how many records contain each rec as a substring."""
    res = res if res is not None else {}

    # FIXME: You should probably use an explicit stack here.
    # You can easily run out of recursion stack.
    if isinstance(n, Leaf) and n.is_record:
        res[bk.x[n.leaf_label-1]] = n.parent.size - 1  # don't count itself
    if isinstance(n, Inner):
        for child in n.children.values():
            count(bk, child, res)
    return res  # we return this just for convinience


def compute_rec_overlap(recs: list[str]) -> dict[int, list[int]]:
    """Compute the overlap summary for recs."""
    bk = remap(recs)
    st = build_st(bk.x)
    annotate(bk, st.root)
    summary = collect(bk, st.root)
    counts = count(bk, st.root)
    return summary, counts


recs = ["foo", "bar", "baz", "foobar", "fooqux", "foobarbaz"]
overlaps, counts = compute_rec_overlap(recs)
for rec, overlaps in overlaps.items():
    print(rec, overlaps)
for rec, count in counts.items():
    print(rec, count)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1