'RAM-efficient set of 1 billion of short strings, and fast lookup [duplicate]

I want to be able to test, as fast as possible, if a given string is a member of a set S of 1 billion of strings:

each string is of length 10 on average, so the whole data is ~ 10 GB, the max length is 100
99% of the strings uses only a-z A-Z 0-9 + a few special chars (.!-;\/), in some rare cases there are other chars
if necessary, the data could be sorted:
```
aaaaa
...
helloworld
hellu
help
...
zzzzz
```
and we could maybe use a prefix tree / trie, but I'm not sure if it's the best solution here.

Question: Is there a way to work with an in-memory compressed set with Python, possibly with the standard library, with fast lookup?

The goal is to test 'hello' in S as fast as possible.

I know how to do this with 10 GB of RAM, but out of curiosity, I'd like to know if this is possible with, say, only 1 or 2 GB of RAM:

S = set()
with open('db.txt', 'r') as f:
   for l in f:
       S.add(l)
print('hello' in S)

Solution 1:^[1]

You should be able to compress your strings and work with the compressed version. However, just how much memory you save will depend on how compressible your dataset is.

I would suggest a custom compression system: build a single compression dictionary based on your entire string corpus, optimized to minimize the total size of dictionary + compressed string storage.

I would use a hash table to look up your compressed strings, not a tree. Each hash bucket can just be a bunch of compressed data, scanned sequentially from the beginning. For buckets that happen to have a lot of long strings, provide an overflow pointer to the rest of the data; the in-table bucket size is a performance parameter -- or you can slide that size all the way to 0 for a fully indirect table.

The sequential scan sounds problematic from a performance point of view, but both the initial hash lookup and the sequential scan are fast on current hardware. The number of hash buckets is what (statistically) limits the sequential scan length; it is another performance parameter.

To implement this in Python, I'd do the hash indexing manually, and store the hash table as an array of bytestrings. I think this will effectively be the indirect table mentioned above, which might not be as fast as possible but it should be pretty good.

# Python-like pseudocode for building the table
compression_dict = compressor.train(strings)
hash_table = [BytesIO() for i in range(table_size)]
for s in strings:
    hash_table[hash(s) % table_size].write(compression_dict.compress(s))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'RAM-efficient set of 1 billion of short strings, and fast lookup [duplicate]

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]