'Set of 10-char strings in Python is 10 times bigger in RAM as expected [duplicate]

See also: Memory usage of a list of millions of strings in Python and the duplicate.

I'm creating a Python set in RAM containing 10 millions of 10-char strings (each char can be expressed inside 0-255, not more complicated than this, no UTF8 non-ASCII complex char).

I thought it should use around ~ 10M * 10 = 100 MB, maybe +50% for data structure, i.e. max 150 MB. But instead the process uses ... 993 MB!

(As a comparison, when I run the same script with 10 elements instead of 1010001000, the process only uses 4 MB in RAM.)

How to make a lighter-in-RAM set of 10-char strings?

import random

def randstr():  # random string of length 10
    return ''.join(random.choice('abcdefghijklmnopqrstruvwxyz') for i in range(10))

s = set()

for i in range(10*1000*1000):
    s.add(randstr())

Note: I would like to keep, if possible, a set() and not another database (like SQLite) because of very fast membership lookup.

Note: I'm using Python 2.7 64-bit

Note: In my real program it's in fact 100 - 500 millions items that's why I want to save memory.



Solution 1:[1]

Ten byte stings are very short. So the overhead is large. Each string takes 47 bytes. In addition each set element takes room for hash value and pointers, with at least 24 bytes, plus 30% to 50% free entries, needed for an efficient hashset.

For your numbers:

  • 470MB for the strings
  • 240MB for the hashes
  • 120MB free room

830MB at minimum plus some working space. There is no way with python to get these numbers down.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Daniel