Set: Hash Tables
Hash Table
For hash table, we take some hash function \(hash: \mathcal U \rightarrow\{0, 1, ..., m - 1\}\) where \(\mathcal U\) is the universe that keys are chosen from, and \(m\) is the number of bins pre-defined for hash table.
Then, we use the hash function to determine which spot should we insert the (key, value) into. Thus, when we want to, we can directly find the spot by running the hash function again.
However, in most cases \(|\mathcal U| >> m\) so that there must exist \(x_1 \neq x_2\neq \cdots x_k. hash(x_1) = \cdots hash(x_k)\). In this case, we call it a collision. The main goal for hash table is to resolve collision problem, and to choose "good" hash function.
Hash Functions
Some good hash function examples for natural number keys \(k\) (note that any key is essentially bits, which can be seen as unsigned intergers) are
-
division method: \(hash(k) = k\mod p\) where \(p \geq m\) is a prime number.
- We'd like to make such the remainder by a prime number, so that it's more likely to depent on all bits of the input key instead of a fixed range of bits. For example, if we use \(k\mod 2^i\), then it will only depend on the last \(i\) bits of \(k\).
- Since it only takes one division operation, it is quite fast.
-
multiplication method: \(hash(k) = \lfloor m (k A - \lfloor kA\rfloor)\rfloor\) where \(0 < A < 1\).
- This method fits well with binary based operations and when \(m\) is set to be a power of 2.
- Often \(A\approx (\sqrt 5 - 1)/2 = 0.6180339887\) is likely to work reasonably well.
Closed Addressing (Chaining)
To resolve a collision, the simplest way is to implement each bin as a linked list (or other list implementations, but runtime analysis will be harder), which is called chaining, or closed addressing.
Specifically,
- Object: fix-sized array of bins
- Operation:
search(H, x)
: return whether \(x\) it's inH[hash(x)]
insert(H, x)
: insert \(x\) intoH[hash(x)]
delete(H, x)
: delete \(x\) fromH[hash(x)]
Assuming simple uniform hashing, i.e. the expected value of each bin size \(a := E(b_i) = n/m\), where \(n\) is the size of elements. Further assume that the hash function takes \(O(1)\) time.
Runtime analysis
Claim 1.1 insert
takes worst case \(O(1)\) time.
proof. insert the new node to the head (only update 2 pointers).
Claim 1.2 Under simple uniform hashing assumption, search
takes average-case \(\Theta(1+a)\) time.
proof. For unsuccessful search, the time needed is 1 hash and traverse the linked list in the bin, which takes \(a\) time.
For successful search, there are \(n\) cases for input \(x\), i.e. \(n\) keys in the table, assuming that each element in the table has \(1/n\) probability being searched. Define \(I_{ij}\) be the indicator random variable that the ith key and jth key being inserted to the same bin. By simple uniform hashing, the probability that \(P(I_{ij} = 1) = 1/m\), thus by Bernoulli distribution, \(E(I_{ij}) = 1/m\). Then, to find some node \(i\), we need to compare the searched key with all nodes before \(i\), and by our insert
, those are nodes inserted after \(i\); and then \(i\). Therefore, for some node \(i\), the number of elements compared is \(1 + \sum_{j=i+1}^n I_{ij}\). We can write the average case time as
Claim 1.3 Under simple uniform hashing assumption, delete
takes average-case \(\Theta(1+a)\) time.
proof. search, and then constant time to update pointers.
Note that, if \(n/m\) is proportional, then \(a\in O(1)\). Thus, all operations take constant worst case expected time.
Open Addressing
For open addressing, we have
- Object: fix-sized array (key, value) of size \(M\)
- Operation:
search(H, x)
: seach for each ofH[h(x, 0)], H[h(x, 1)], ...
until find or encounteredNone
insert(H, x)
: tryH[h(x, 0)], H[h(x, 1)], ...
until we findNone
orDEL
ati
, then insert intoH[h(x, 1)]
.delete(H, x)
: search for \(x\), say atH[h(x, 0)]
, then makeH[h(x, 0)] = DEL
The successively examining of H[h(x, 0)], H[h(x, 1)], ...
is called probing.
First, note that our array eventually have size \(M \geq n\), otherwise we are unable to store all wanted key, value pairs, since the table can be filled up.
Then, Note that in this case, \(h(x, i)\) take a key \(x\) and a counter \(i\). We define this as a parameterized hash function
Consider the sequence \((h(x, i))_{i=0}^{M-1}\), calling it the probe sequence of key \(x\). It must be a permutation of \(\{0, 1, \cdots, M-1\}\) so that the sequence performs like the insertion order for \(x\). We want to make the uniform hashing assumption, i.e. \(h\) generates probe sequence for each key being equally likely to be any of \(M!\) permutaions of \(\{0, ..., M-1\}\). Unfortunately, such assumption is hard to implement.
The following techniques guarentee that the probe sequence for each key \(x\in\mathcal U\) is valid. However, the mast number of probe sequences that they can generate is \(m^2\). Thus none of them satisfy the uniform hashing assumption.
All methods uses the ordinary hash function \(hash\) we defined above, refers to as auxiliary hash function. We know that \(hash\) will collide, so that we want to avoid collision by the additional \(i\).
Linear Probing
Consider the example of \(c_0=1\), then the probe sequence is simply \((hash(k), hash(k) + 1, ...,M-1, 0, 1, ...)\).
However, linear probing suffers from primary clustering. Clusters arise because an empty slot preceded by i full slots gets filled next with probability \((i+1)/M\). Thus, the longer it runs, the occupied slots tend to get longer. Also, if \(hash(x) = hash(y)\), then their probe sequences are the same.
Quadratic Probling
The same issue that if if \(hash(x) = hash(y)\), then their probe sequences are the same. Also, it still suffers clustering problem, but milder than linear probing.
Double Hashing
It is "double" as we use two different hash functions. Thus, the chance that \(hash_1, hash_2\) both collide is much reduced. In practice, this tends to give the best results.
Universal Hashing
With a fixed hash function, then a malicious adversary can choose \(n\) keys that all collide with each other. The only efficient way to improve the situation is to randomly sample hash functions in a way that is independent of the keys that are actually going to be stored.
The approach is called universal hashing, at the beginning of hashtable construction, we randomly pick a hash function from a class of functions. Therefore, given a fixed sequence of input and opeartions, each execution will have different results, and guarentees a good average case time.
Let \(\mathcal H\) be a finite collection of hash functions \(h: \mathcal U \rightarrow \{0, 1,..., m-1\}, \mathcal H = \{h_1,...,h_n\}\). \(\mathcal H\) is universal if for each pair of distinct keys \(k_1, k_2\in\mathcal U\). There are at most \(|\mathcal H|/m\) hash functions \(h\in\mathcal H\) s.t. \(h(k_1) = h(k_2)\).
In other words, with a hash function randomly chosen from \(\mathcal H\), for any pair of distinct keys \(k_1, k_2\),
Universal Hashing Theorem
Claim Using universal hasing and chaining. The worst case expected running time for search
is \(1 + \frac{n}{m}\).
proof. The settings are very much similar to the average case running time for chaining, but note that we are now considering expected running time instead of average running time.
Let \(h\in\mathcal H\) be chosen randomly, let \(n\) be the size of data in the table \(T\), define \(a = n/m\) be the load factor. \(I_{ij} =\mathbb I(h(k_i) = h(k_j))\). Since \(h\in\mathcal H\), \(I_{ij} = P(h(k_i) = h(k_j))\leq 1/m\).
Define the random variable \(Y_i\) be the number of keys other than \(k_i\) that is in the same bin as \(k_i\). Then,
Thus, for unsuccessful search, \(k_i\not\in T, E(Y_i) \leq \sum_{k_j\in T} \frac{1}{m} = \frac{n}{m}\).
for successul search, \(E(Y_i) \sum_{k_j\in T, i\neq j} \frac{1}{m} = \frac{n-1}{m}\).
Let the random variable \(n_{h(k)}\) be the number of linked list nodes traversed in bin \(h(k)\),
for unsuccessful search, it traverses all nodes, \(E(n_{h(k_i)}) = E(Y_i) \leq n/m\);
for successful search, it traverses all nodes plus \(k_i\) itself, \(E(n_{h(k_i)}) = E(Y_i) + 1\leq \frac{n-1}{m} +1 < \frac{n}{m}+1\);
Universal Class of Hash Functions
For \(\mathcal U = \{0, 1, ..., 2^{w} - 1\}\), for a hash table with \(m=2^M\) bins, the following family of hash functions is universal
This family is very natural for binary based computers. Note that \(\mathcal U\) represent is the set of binary representations of unsigned intergers with \(w\) bits, typically we have \(w = 16, 32, 64\). Then, \(ak+b\) is unsigned integer arithmetic, \(\mod 2^w\) is simply taking the last \(w\) bits, and \(// 2^{w-M}\) is a bit shift, taking the \(M\) left most bits.
For a hash table with \(m\) bins, let \(p > m\) be prime, define \(\mathbb Z_{p} = \{0, 1, ..., p-1\}, \mathbb Z_p^* = \{1, ..., p-1\}\). The following family of hash functions is universal
Implementation
Header for hashset
#include <assert.h>
#include <math.h>
#include <stdbool.h>
#include <stddef.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#define BUCKET_SIZE 10
typedef struct bucket_t {
unsigned int size;
unsigned int max_size;
int *arr;
};
typedef struct hash_table_t {
int size; // should be a prime number
bucket_t* buckets;
};
static bool is_prime(int n) {
assert(n > 0);
for (int i = 2; i <= sqrt(n); i++) {
if (n % i == 0)
return false;
}
return true;
}
// Get the smallest prime number that is not less than n (for hash table size
// computation)
int next_prime(int n) {
for (int i = n;; i++) {
if (is_prime(i))
return i;
}
assert(false);
return 0;
}
// Create a hash table with 'size' buckets; the storage is allocated dynamically
// using malloc(); returns NULL on error
hash_table_t *hash_create(int size) {
assert(size > 0);
int tsize = next_prime(size);
hash_table_t *table = (hash_table_t *) malloc(sizeof(hash_table_t));
if (table == NULL) {
return NULL;
}
table->size = tsize;
table->buckets = (bucket_t *) malloc(tsize * sizeof(bucket_t));
if (table->buckets == NULL) {
free(table);
return NULL;
}
for (int i = 0; i < tsize; i++) {
int *arr = (int *) malloc(BUCKET_SIZE * sizeof(int));
(table->buckets[i]).max_size = BUCKET_SIZE;
(table->buckets[i]).size = 0;
(table->buckets[i]).arr = arr;
}
return table;
}
// Release all memory used by the hash table, its buckets and entries
void hash_destroy(hash_table_t *table) {
assert(table != NULL);
for (int i = 0; i < table->size; i++) {
free((table->buckets[i]).arr);
}
free(table->buckets);
free(table);
}
// Returns 0 if key is not found, 1 if exist
int hash_get(hash_table_t *table, int key) {
assert(table != NULL);
int hash_value = key % table->size;
bucket_t *bucket = &(table->buckets[hash_value]);
for (int i = 0; i < bucket->size; i++) {
if (bucket->arr[i] == key) return 1;
}
return 0;
}
// Returns 0 on success, -1 on failure
int hash_put(hash_table_t *table, int key) {
assert(table != NULL);
int hash_value = key % table->size;
bucket_t *bucket = &(table->buckets[hash_value]);
// if already exist, dont need to put
for (int i = 0; i < bucket->size; i++) {
if ((bucket->arr)[i] == key) return 0;
}
if (bucket->size == bucket->max_size) {
bucket->max_size *= 2;
bucket->arr = (int *) realloc(bucket->arr, (bucket->max_size) * sizeof(int));
}
bucket->arr[bucket->size++] = key;
return 0;
}