'How does hash-table in set works in python?

As far as I know, set in python works via a hash-table to achieve O(1) look-up complexity. While it is hash-table, every entry in a set must be hashable (or immutable). So This peace of code raises exception:

>>> {dict()}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'

Because dict is not hashable. But we can create our own class inherited from dict and implement the __hash__ magic method. I created my own in this way:

>>> class D(dict):
...     def __hash__(self):
...             return 3
...

I know it should not work properly but I just wanted to experiment with it. So I checked that I can now use this type in set:

>>> {D()}
{{}}
>>> {D(name='ali')}
{{'name': 'ali'}}

So far so good, but I thought that this way of implementing the __hash__ magic method would screw up the look up in set. Because every object of the D has the same hash value.

>>> d1 = D(n=1)
>>> d2 = D(n=2)
>>> hash(d1), hash(d2)
(3, 3)
>>> 
>>> {d1, d2}
{{'n': 2}, {'n': 1}}

But the surprise for me was this:

>>> d3 = D()
>>> d3 in {d1, d2}
False

I expected the result to be True, because hash of d3 is 3 and currently there are values in our set with the same hash value. How does the set works internally?



Solution 1:[1]

To be usable in sets and dicts, a __hash__ method must guarantee that if x == y, then hash(x) == hash(y). But that's a one-sided implication. It's not at all required that if hash(x) == hash(h) then x == y must be true. Indeed, that's impossible to achieve in general (for example, there are an unbounded number of distinct Python ints, but only a finite number of hash codes - there must be distinct ints that have the same hash value).

That your hashes are all the same is fine. They only tell the set/dict where to start looking. All objects in the container with the same hash are then compared, one by one, for equality, until success, or until all such objects have been tried without success.

However, while making all hashes the same doesn't hurt correctness, it's a disaster for performance: it effectively turns the set/dict into an exceptionally slow way to do an O(n) linear search.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Tim Peters