Is it hashable? Fun and games with hashing in Python

One of the basic data types that Python developers learn to use, and to appreciate, is the dictionary, or “dict.” This is the Python term for what other languages call hashes, associative arrays, hashmaps, or hash tables. Dictionaries are pervasive in Python, both in the programs that we write, and in the implementation of the language; behind every namespace or object, at least one dictionary is behind the scenes.

Dictionaries are fairly easy to use, once you get used to the rules of the road:

  1. A dictionary contains pairs, not individual elements. Each pair has two elements, a “key” and a “value.” So given a dictionary d, len(d) will return the number of pairs, not the number of individual elements.
  2. You can think of the key as a sort of index. Just as we use numeric indexes to retrieve elements of a string, list, or tuple, we use a dict’s keys to retrieve its values.
  3. The retrieval is one-way. You can get a value via its key, but you cannot get a key via its value.
  4. The retrieval takes constant time, aka O(1). You can use the “in” operator to find out if a key exists in a dictionary. If you retrieve a key that doesn’t exist, you’ll get a KeyError exception.
  5. The key must be hashable, and (if a container, such as a tuple) may only contain other hashable objects.
  6. The values may be any Python types or sizes. You can have a dict of strings, but also a dict of lists, tuples, dicts, modules, or any other objects.
  7. The keys of a dictionary are unique. If you assign d[‘a’]=1 to the dict “d”, the key “a” now exists, with a value of 1. If the key “a” already existed, then its previous value is lost.
  8. The key-value pairs in a dictionary are not ordered in any meaningful way. Do not depend on the order of the pairs in a dictionary.

To anyone familiar with dicts, or with hash tables in other languages, most of the above rules make a great deal of sense. Indeed, most of them follow naturally from the implementation of dicts: When you store d[‘a’] = 1, the dict “d” takes the key “a” and invokes the hash function on it. The result of the hash function is a number, which indicates where in the hash table the key-value pair should be stored. This is the key (no pun intended) advantage of a dictionary, and the secret of its lookup speed: The result of applying the “hash” function on our key determines where the key-value pair will be stored. Python can then jump to that location in memory, and retrieve the value associated with the key.

This also explains why you can use keys to retrieve values, but not the reverse: The location of a value in memory depends completely on its key. Moreover, while keys must be unique, values don’t have to be.

Furthermore, this explains why pairs in a dict don’t seem to be ordered in any predictable way; their order is determined by the hash function, which is deliberately designed to provide hard-to-predict results.

For example, I can create a simple dictionary:

>>> d = {'a':1, 'b':2, 'c':3}
 >>> d
 {'a': 1, 'c': 3, 'b': 2}

As you can see, the printed representation of our dictionary shows the keys in the order ‘a’, ‘c’, and ‘b’, rather than the order in which they were inserted. Assigning to the dictionary either replaces an existing pair (if I reuse a key) or adds a new pair:

>>> d['a'] = 100
 >>> d
 {'a': 100, 'c': 3, 'b': 2}
 >>> d['z'] = [1,2,3]
 >>> d
 {'a': 100, 'c': 3, 'b': 2, 'z': [1, 2, 3]}

Almost all of this matches the rules for dict-like structures in other languages — except for rule #5, the requirement that the keys be hashable. (Or if we’re dealing with container objects, that the contained elements be hashable.) It’s reasonable to ask why this is forbidden. There aren’t a lot of times when I would like to use a list, set, or dict as a dictionary key, but it does happen. Why does Python prevent me from doing so?

The answer has to do with predictability: If I could use a list as my dictionary key, then there would be the chance of the list changing after storing it. In such a case, the list’s current hash value will be different than its previous hash value — meaning that the list will be located somewhere other than where it should be. In such a case, the key-value pair would be “lost” inside of the dict.

This can actually happen in Ruby, which doesn’t restrict the data types which can be used as keys. For example:

myarray = [1,2,3]    # create a Ruby Array 
h = {myarray => 1}   # use the array as a key

h[myarray]           # What value is associated with myarray?
   => 1              # We get 1 back, as expected! Yay!

Ruby stores our name-value pair inside of the hash, its equivalent of a dict. However, I can modify the array that is being used as a hash key:

myarray << 4        # append 4 to myarray

    => [ 1, 2, 3, 4 ]

When we stored the name-value pair in myarray, the Array had three elements. Now it has four, thus giving it a new hash value. After modifying myarray, we can ask Ruby to retrieve the value associated with it in h:

h[myarray]         # Get the value for key "myarray"
     => nil        # nil means non-existent key

In other words, the hash still has the key “myarray”, but the key-value pair is stored in the location determined by myarray.hash when we first stored it, not in its current incarnation.

Ruby’s solution to this problem is to provide a “rehash” method, which tells a hash to go through its contents, and recalculate the keys’ hash values and locations. Once you do this, the data returns:

h.rehash  # recalculate positions
    =>  { [ 1, 2, 3, 4 ] => 1 }

h[myarray]
    => 1

We thus see that in Ruby, we’re allowed to use mutable data structures as keys. The advantage is that we’re not limited, but the disadvantage is that keys might get changed, and thus provide incorrect search results.

Python, many years ago, solved this problem a different way: Instead of allowing us complete flexibility in our hash keys, Python restricted us, to (largely) immutable ones. Thus, we can use None, True, False, integers, floats, strings, and tuples — although ints and strings are the most common, in my experience. If we try to store a key-value pair in a dictionary, Python checks to make sure that the key is a hashable type:

>>> mylist = [1,2,3]

>>> d = {mylist:1}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

The “unhashable” error message that we get isn’t from the assignment to d, but rather from the call to hash() that Python makes on “mylist”. We can see this if we try to invoke the hash function directly:

>>> hash(mylist)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

It’s not really true, as we’ve seen, that lists are inherently unhashable. Rather, Python decided long ago that it would refuse to hash anything whose value might be subject to change, to avoid elements getting lost.

The hash function in Python has been described before, in more detail (and with greater knowledge) than I could provide. There are two aspects to Python’s hash function, which I’m not in a position to criticize, but which do seem strange to me:

  1. Hash functions, by their very nature, are supposed to be one-way, deterministic, but fairly unpredictable. That is, if I know the output of hash(‘a’), I shouldn’t be able to easily know what hash(‘b’) or hash(‘c’) is. But hash(‘a’) should always return the same value. In Python, this is the case for strings and tuples. But hash(), when handed an int, returns that int. Thus, hash(1) is 1, hash(100) is 100, and hash(255) is 255. This strikes me as a bit strange, and seems to violate one of the basic rules of hashing. I can only conclude that either I don’t know much about hash functions (which is quite possible), that Python doesn’t expect us to use many integers as dictionary keys, or that it just doesn’t matter that much.
  2. The hash function apparently returns -1 when it encounters an error. Thus, the hash values of both -1 and -2 are -2.

The result of a hash function doesn’t need to be unique, but it does need to evenly distribute the results, such that we’ll minimize collisions. That is, it’s possible that hash(‘a’) and hash(‘b’) will return the same value — but it should be hard to figure out which values will give us the same results. If, by some chance, all of your keys have the same hash value, then you end up with a “collision.” This is invisible to the user of the dict, except that the lookups suddenly become much slower. Imagine a dict in which 100 keys all have the same hash value; our lookup speed suddenly becomes O(n), like a list, rather than O(1), which is theoretically possible in a dict.

This apparently became an issue several years ago, when there were some attacks against Web sites running Python. Web applications often use dicts to pass incoming parameters, which means that if you choose your keys cleverly enough, you can cause a massive slowdown on a site, in a denial-of-service attack.

The solution is to add a random seed to the hashing algorithm. This isn’t implemented in Python by default, but can easily be added by invoking Python 2.7 with the -R command-line parameter, or Python 3.x with the PYTHONHASHSEED environment variable set.

Thus, in Python 2.7:

$ python -R
>>> hash('a')
-5027793331667802690
>>> hash('b')
-5027793332354350531
>>>

$ python -R
>>> hash('a')
-4154372447873558006
>>> hash('b')
-4154372448337085303
>>>

Notice how, thanks to the -R parameter, we force Python to re-seed its hash function, thus reducing the chance of a successful attack.

Python 3 took this a step further, by using an environment variable. If you set PYTHONHASHSEED  to “random”, then it behaves like Python 2, above. But if you set PYTHONHASHSEED to a numeric value, then the hash function is seeded with the number you provide. This makes it easier to test your code, but also to enjoy the extra security that the randomized hash keys provide.

Now, you would think that from everything I wrote above, that if I write my own class, it won’t be hashable. But it turns out that this is not the case:

>>> class Foo(object):
        pass

>>> f = Foo()
>>> hash(f)
273483861

According to the Python documentation, user-defined classes are hashable by default; the hash value of such an object depends on the object’s unique ID number, which we can get via the built-in “id” function. This means that the hash value of a user-defined object won’t change, regardless of any changes you might make to its attributes.

But let’s say that I want to have the hash reflect the attributes. According to the Python documentation, this means that I should define both the __hash__ method (which the built-in “hash” function will call on our object, and which must return an integer) and the __eq__ method, to check if two things are equal (since two equal objects should have equal hashes, too). I’m going to define a simple class, along with a __hash__ method:

>>> class Foo(object):
        def __init__(self, x):
            self.x = x
        def __hash__(self):
            return hash(self.x)

>>> f = Foo('a')
>>> hash(f)
12416037344
>>> hash('a')
12416037344

As the above code demonstrates, our object now returns the hash value of whatever is set on its “x” attribute. There is no difference between invoking hash(‘a’) and hash(f), assuming that f.x is ‘a’.

So now let’s put our object in a dictionary:

>>> d = {f:1}
>>> d[f]
1

So far so good, right? But now let’s be a bit evil, and change the value of f.x:

>>> f.x = 'abc'
>>> d[f]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: <__main__.Foo object at 0x109987590>

What happened? Well, we forced Python to be like the Ruby example that I provided earlier: When we created d, and used f as a key, Python stored the pair {f:1} based on the value of hash(f). But then we changed the value of f.x, which means that the value of hash(f) has changed, Which means that now, when we invoke d[f], Python will complain that there is no such key. The only way to get this key back is to use d.keys() or d.items(), which will return the key. But our ability to retrieve our value via the key, or even to check if our key exists, is now gone:

>>> f in d
False

We can have even more fun, by ensuring that the value of __hash__ changes every time we invoke it:

>>> import random
>>> class LoseMe(object):
        def __hash__(self):
            return random.randint(1,1000)

>>> x = LoseMe()
>>> hash(x)
374
>>> hash(x)
50

Now, the odds are pretty good that when I stick this object into a dictionary, I won’t be able to get it back:

>>> x = LoseMe()
>>> d = {x: 1}
>>> x in d
False
>>> x in d
False

But of course:

>>> list(d.keys())
[<__main__.LoseMe object at 0x109997a10>]

Should you ever define __hash__ to return a random value? Almost certainly not. And yet, I’d like to think that knowing how to do such a thing is both interesting and provides insights into how Python implements one of its core features.

 

2 thoughts on “Is it hashable? Fun and games with hashing in Python”

  1. My own understanding is that the Python hash function is not intended for cryptographic uses.

    The “unpredictable” requirement stems from the need to authenticate an object: I don’t have to know your original password but only its hash value; if you enter something that hashes to the right value then you’ve probably got it right. Or if you modify a file with a virus, you will have a really hard time trying to also forge the same hash. You use a cryptographically secure hash function for this type of jobs.

    On the other hand, people often use a hash function purely to give uniform identifiers to stuff. You don’t need all that secrecy if your only purpose is to build a library index.

Leave a Reply

Your email address will not be published. Required fields are marked *

10 × one =