'Bloom, Cuckoo filters are too big for numbers

I'm trying to use bloom/cuckoo filter to check if i already stored a tuple of 2 numbers f.e. (7,25), (47,1576), ... etc

In [12]: from cuckoo.filter import ScalableCuckooFilter                                                                                                                      

In [13]: c = ScalableCuckooFilter(initial_capacity=1000000, error_rate=0.0001)                                                                                               

In [14]: c.filters[0].buckets.buffer_info()                                                                                                                                  
Out[14]: (140042688086032, 8500000, 'big', 0, 8500000, 0, 0, 0)

In [15]: 8500000/(1024*1024)                                                                                                                                                 
Out[15]: 8.106 MB

to store the numbers i would need 2bytes per number, so 1mln * 4bytes = 3.8 Mb

So the filter is bigger than the data !

Is there a way or structure that is suitable for numbers with lower memory footprint ?



Solution 1:[1]

There is a lower bound on the size of approximate membership queries like cuckoo filters and Bloom filters. It is -log2(epsilon), which for you is 13.3 bits. You can get close to this by using bumped ribbon filters.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 jbapple