A hash function projects a value from a set with many (or even an infinite number of) members to a value from a set with a fixed number of (fewer) members. Hash functions are not reversible. A hash function might, for instance, be defined as , where , , and is the floor function.
Hash functions can be used to determine if two objects are equal (possibly with a fixed average number of mistakes). Other common uses of hash functions are checksums over a large amount of data (e.g., the cyclic redundancy check [CRC]) and finding an entry in a database by a key value. The UNIX c-shell (csh) uses a hash table to store the location of executable programs. As a result adding new executables in a user's search path requires regeneration of the hash table using the rehash command before these programs can be executed without specifying the complete path.
To illustrate the use of hash functions in database lookups, consider a database consisting of an array containing an index , a name, and a telephone number, with names listed in arbitrary order.
name | number | |
0 | Parker | 12345 |
1 | (empty) | |
2 | Davis | 43534 |
3 | Harris | 32452 |
4 | Corea | 46532 |
5 | Hancock | 96562 |
6 | Brecker | 37811 |
7 | (empty) | |
Marsalis | 54323 |
To look up Hancock from this array, you would start at the beginning of the array, compare the names, then try the next until the names match. This very simple algorithm finds any entry in 1 to steps, giving an average seek time of . The seek time is therefore proportional to . A much faster result can generally be achieved, if the database is sorted.
name | |
0 | Brecker |
1 | Corea |
2 | Davis |
3 | Hancock |
4 | Harris |
5 | Marsalis |
6 | Parker |
7 | (empty) |
(empty) |
An efficient algorithm on this sorted array first checks entry , and then recursively uses bisection to check entries in intervals or , depending wether the most recently looked-up name precedes or succeeds the name sought. The average seek time of this procedure this is proportional to .
The idea behind using a hash function here is that although the possible number of combinations of characters in a name is quite large, only a subset of them is usually found in practice (i.e., names such as "Kwqrst" are much less common than names like "Jones.") Therefore, when you insert an entry into the database at an index that can somehow be calculated using a key (which is also available at the time you search for it), you might be able to find it later at the first location you check.
Consider the following simple example in which the hash function is simply the sum of ASCII codes of characters in a name (considered to be all in lowercase) computed mod .
name | |
Brecker | 6 |
Corea | 2 |
Davis | 2 |
Hancock | 12 |
Harris | 12 |
Marsalis | 2 |
Parker | 8 |
The above example illustrates that the hash function can give the same results for different keys. This difficulty is typically circumvented by introducing a second hash function whose results are designed to be completely different from that of . For illustrative purposes, let be one plus the bitwise exclusive or of all codes in a name (again taken as all lowercase) mod . This gives the following table.
name | |
Brecker | 11 |
Corea | 3 |
Davis | 10 |
Hancock | 4 |
Harris | 8 |
Marsalis | 3 |
Parker | 8 |
A new index can then be calculated as the sum of the first index and (mod ) until an empty slot is found where new data can be stored. Note that when using as an offset to walk through the database, it is not, in general, guaranteed that any key will eventually reach any slot. However, for certain values of , namely a prime number, such behavior is guaranteed, so is always chosen to be prime. After computing with (a prime), the above phone list would look like this for names added in alphabetic order.
index | key | compares to find |
0 | (empty) | |
1 | (empty) | |
2 | Corea | 1 |
3 | Hancock | 2 |
4 | (empty) | |
5 | Marsalis | 2 |
6 | Brecker | 1 |
7 | Harris | 2 |
8 | Parker | 1 |
9 | (empty) | |
10 | (empty) | |
11 | (empty) | |
12 | Davis | 2 |
The average seek time for locating a name in this table depends on the kind of data, , and the quality of the hash functions used. However, for reasonable choices of hash functions, it will be much smaller than .