A hash function
projects a value from a set with many (or even an infinite number of) members to
a value from a set with a fixed number of (fewer) members. Hash functions are not
reversible. A hash function
might, for instance, be defined as
, where
,
, and
is the floor function.
Hash functions can be used to determine if two objects are equal (possibly with a fixed average number of mistakes). Other common uses of hash functions are checksums over a large amount of data (e.g., the cyclic redundancy check [CRC]) and finding an entry in a database by a key value. The UNIX c-shell (csh) uses a hash table to store the location of executable programs. As a result adding new executables in a user's search path requires regeneration of the hash table using the rehash command before these programs can be executed without specifying the complete path.
To illustrate the use of hash functions in database lookups, consider a database consisting of an array containing an index , a name, and a telephone number, with names listed in arbitrary
order.
name | number | |
0 | Parker | 12345 |
1 | (empty) | |
2 | Davis | 43534 |
3 | Harris | 32452 |
4 | Corea | 46532 |
5 | Hancock | 96562 |
6 | Brecker | 37811 |
7 | (empty) | |
Marsalis | 54323 |
To look up Hancock from this array, you would start at the beginning of the array, compare the names, then try the next until the names match. This very simple algorithm
finds any entry in 1 to
steps, giving an average seek time of
. The seek time is therefore proportional to
. A much faster result can generally be achieved, if the database
is sorted.
name | |
0 | Brecker |
1 | Corea |
2 | Davis |
3 | Hancock |
4 | Harris |
5 | Marsalis |
6 | Parker |
7 | (empty) |
(empty) |
An efficient algorithm on this sorted array first checks entry , and then recursively uses bisection to check entries in
intervals
or
, depending wether the most
recently looked-up name precedes or succeeds the name sought. The average seek time
of this procedure this is proportional to
.
The idea behind using a hash function here is that although the possible number of combinations of characters in a name is quite large, only a subset of them is usually found in practice (i.e., names such as "Kwqrst" are much less common than names like "Jones.") Therefore, when you insert an entry into the database at an index that can somehow be calculated using a key (which is also available at the time you search for it), you might be able to find it later at the first location you check.
Consider the following simple example in which the hash function is simply the sum of ASCII codes of characters in a name (considered
to be all in lowercase) computed mod
.
name | |
Brecker | 6 |
Corea | 2 |
Davis | 2 |
Hancock | 12 |
Harris | 12 |
Marsalis | 2 |
Parker | 8 |
The above example illustrates that the hash function can give the same results for different keys. This difficulty is typically circumvented by introducing a second
hash function
whose results are designed to be completely different from that of
. For illustrative purposes, let
be one plus the bitwise exclusive or of all codes in a name
(again taken as all lowercase) mod
. This gives the following table.
name | |
Brecker | 11 |
Corea | 3 |
Davis | 10 |
Hancock | 4 |
Harris | 8 |
Marsalis | 3 |
Parker | 8 |
A new index can then be calculated as the sum of the first index and (mod
) until an empty slot is found where new data can be stored.
Note that when using
as an offset to walk through the database, it is not, in general, guaranteed that
any key will eventually reach any slot. However, for certain values of
, namely
a prime number, such behavior
is guaranteed, so
is always chosen to be prime. After computing
with
(a prime), the above phone
list would look like this for names added in alphabetic order.
index | key | compares to find |
0 | (empty) | |
1 | (empty) | |
2 | Corea | 1 |
3 | Hancock | 2 |
4 | (empty) | |
5 | Marsalis | 2 |
6 | Brecker | 1 |
7 | Harris | 2 |
8 | Parker | 1 |
9 | (empty) | |
10 | (empty) | |
11 | (empty) | |
12 | Davis | 2 |
The average seek time for locating a name in this table depends on the kind of data, , and the quality of the hash functions
used. However, for reasonable choices of hash functions, it will be much smaller
than
.