A hash table is a data structure that uses a hash function to reduce given keys (ie. from key-value pairs) to indices in an array. The goal is to make an associative array, a way to look-up keys to find their associated values. ie. In a phone book, keys are the names and their associated values are the phone numbers; So, when you look-up a name in a phone book you get the phone number associated with that name.
There is also the keys-only application, ie. a spelling dictionary, where we only need to store keys and the keys themselves are the values we care about.
A hash function is used to reduce a key into a manageable number. The key is often a string, but it could also be video, audio, graphics or any form of data. The challenge with reducing keys into a (small) number is pattern repetition, keys often are similar — but not identical — to one another. For instance there are many English words that end in ‘ation’, ‘ing’ or ‘ully’. A hash function must aim to reduce keys, even similar keys, into different numbers (indices) to avoid having congestion on few popular indices and gaps in others.
A hash function’s output (often referred to as just ‘hash’), usually a 32-bit integer, is too large for the purpose of a practical application. If used as an array’s index, it would dictate the need for an array of size 2^32 (~4 Billion) and since each array position would at the very least be 4 bytes long (in the case of 32bit pointers). That would mean 4*4 Billion bytes = ~16GB of memory just for the array’s addresses and we haven’t even started indexing data yet.
So, obviously, a hash function’s output needs to be reduced further into something smaller, ie. a few million, a few hundred thousand or even less. Depending on the size of our input (the number of keys), we can choose a small or a large number of ‘Buckets’ (the array’s size, often referred to as N) to fit our purpose.
The most common compression function is c = h mod N that reduces a hash h to a range of 0..N.
One immediate improvement we can do on the application side, without touching the hash function, is use of a prime number of buckets, or if the hash function is of a certain type and its constant is known, a coprime of that constant.
Non Prime N
If we don’t want to use a prime N, a similar improvement can be achieved by using a better compression function: c = ((a*h + b) mod p) mod N)
Where a,b,p are constants such that a and p are prime and p is much larger than a.
Even with the best hash function we will always get some keys being reduced to the same index (a collision). This is guaranteed to be true if our choice of N is smaller than the input, it is probably going to be true even if our choice of N is larger than the input (see Birthday Paradox).
One way to handle collisions is by using a Chained Hash Table where each array index points to a linked list (or any similarly appropriate data structure, ie a tree).
Load Factor 0.75..1
Ideally we would use N = size of input and expect the keyset to spread evenly across the table in such a way that each index location has only one entry associated with it. That would be an average load factor of 1.
In reality, if we are actually aiming for index locations with a high probability of having one and only one entry in them (ie. if comparison of keys is very CPU expensive), we’d need N to be bigger than the size of input by eg. 20-25%. That, of course, is still dependent on having a very good hash function (and compression function), but even then it won’t be perfect. A ~20-25% larger N can give us an average load factor between 0.75 and 1, with (ideally) few long chains.
The drawback is apparent if we think of the memory used. For example in an input size of 1,000,000 the N would be ~ 1,250,000 meaning 1,250,000 pointers (4 bytes each) that’s ~4.76 MBytes spent on the hash table’s array alone.
The advantage is also apparent, less collisions (fewer chains) translate to less comparisons to check if a key exists, and is lighter on the CPU.
The opposite approach
A different approach is to aim for an average load factor k, where k is the (desired) length of the linked list, or number of leafs in a tree, that is reasonable to your application. ie k = 10, or k = 20, or k = 30…
However depending on the application (the nature of the keys) and implementation of the chaining, this may be unreasonably CPU expensive. Unless we devise a smart, non-trivial method for storing the chains and quickly excluding large parts of it, cheaply.
Hash functions used for hash table look-ups need to be fast and cheap, so every effort should be made to optimize them, ie. change multiplication, division with bit shifts.
Hash functions used for encryption have the opposite desired goal, they need to be expensive in such a way that a brute force attack is expensive, so the operations used are intentionally many and should not be easily reducible to faster counterparts.
I wrote this program to visualize how string hash functions compare to one another. HashGraph demonstrates the different behaviors of hash functions using a graph that shows the ‘hits per bucket’ on different buckets, under different bucket sizes and compression functions.
The hash functions demoed are:
- Java’s own String hashCode()
Developed/tested under Java 6.
The core app uses GCanvases from the ACM library to make things easier. The About dialogue was graphically designed in Netbeans. The wordlist used is from CS50 pset6.
I had seen applets launching application windows onto the desktop before, and I wanted to try that so that’s all the applet (
HashGraph.java (view | download)
AboutDialogue.java (view | download) — Shows how to launch a URL (‘click’ a link) from a dialogue, either via an applet context or via a desktop context.
LaunchHashGraph.java (view | download) — Shows how to launch a JFrame from within an applet.
or get HashGraph-src.zip
At the moment I’m following CS61B and I’ve reached the part where they talk about Hash Tables. This gave me an incentive to re-visit this topic, again. This post and the HashGraph app, should be helpful as a reference when doing Harvard’s CS50 pset6 or Berkeley’s CS61B hw6.