Practical Programming David Bouchet
Practical Work #7

Hash Tables

Submission

Due Date

By Friday 14 December 2018 23:42

Directory Hierarchy

Create your git repository (replace john.smith by your own login).

$ git clone git@git.cri.epita.net:p/2022-spe-tp/tp07-john.smith

It must contain the following files and directories:

The AUTHORS file must contain the following information.

AUTHORS
First Name
Family Name
Login
Email Address

The last character of your AUTHORS file must be a newline character.

For instance:

AUTHORS
$ cat AUTHORS
John
Smith
john.smith
john.smith@epita.fr
$ # Command prompt ready for the next command...

Be careful, if you do not follow all the given instructions, no point will be given to your answers.

Introduction

In this practical, you will implement a hash table. The keys and values will be strings of characters.

For the examples, we will use countries and their capital cities. The countries will be the keys and the capitals will be the values. For instance, we can have the following key-value pairs:

Our hash table will be made up of two structures.

The first structure is as follows:

struct htab
{
    size_t capacity;
    size_t size;
    struct pair *data;
};
capacity
The total number of elements that can be stored in the data array. In other words, it is the length of the data array.
size
The number of actual elements within the data array. In other words, it is the number of used cells within the array. Be careful, the used cells are not necessarily contiguous (e.g. only the first and last cells can be used).
data
An array of elements. In this specific case, it is an array of 'struct pair' values. Actually, each cell in this array will contain a linked list of 'struct pair' values. For instance, if the array's capacity is four, it will contain four linked lists. These lists will use sentinels. So, when the array is empty (size = 0), it contains only sentinels. Linked lists will be used in order to handle collisions between keys. See also Separate chaining with linked lists for further detail.

The second structure will contain a key-value pair, which is an element of the linked list.

struct pair
{
    uint32_t hkey;
    char *key;
    void *value;
    struct pair *next;
};
hkey
The hash value of the key.
key
The key.
value
The value associated with the key. We use the void* type so that we do not have to use strings of characters even if all the examples in this practical will use only strings of characters.
next
Points to the next key-value pair.

Provided Files

The htab.h File

htab.h
#ifndef _HTAB_H_
#define _HTAB_H_

#include <stdint.h>
#include <stdlib.h>

struct pair
{
    uint32_t hkey;
    char *key;
    void *value;
    struct pair *next;
};

struct htab
{
    size_t capacity;
    size_t size;
    struct pair *data;
};

// Return the hash value of the 'key' string.
// Use the Jenkins's one_at_a_time hash.
uint32_t hash(char *key);

// Create a new empty hash table.
// The initial capacity is 4.
// The initial size is 0.
// If there is not enough memory, the program prints
// "Not enough memory!" and exits with the error code 1.
// (Use the errx() function of the standard library.)
// Be careful, you have to allocate two memory spaces.
// - The memory space that holds the 'struct htab' variable.
// - The memory space that holds the data.
//   All cells of the 'data' array must be initialized to zero
//   (they contain the sentinels of the linked lists.)
struct htab *htab_new();

// Delete all the pairs of a hash table.
// Free the memory allocated by the pairs.
// The 'data' array is not freed.
// The table's capacity does not change.
// The table's size is set to zero.
// After this function, the hash table can still be used.
void htab_clear(struct htab *ht);

// Delete a hash table.
// Free all the allocated memory.
// After this function, the hash table can no longer be used.
void htab_free(struct htab *ht);

// Return a pair of the hash table from its key.
// (The pair is not removed from the hash table.)
// If the pair is not in the table, return NULL.
struct pair *htab_get(struct htab *ht, char *key);

// Insert a pair into the hash table.
// If the pair is already in the table, return 0.
// Otherwise:
// - Insert the pair in the hash table.
// - Increment the size if the pair has been placed into an empty cell.
// - If the ratio (size / capacity) is larger than 75%,
//   double the capacity of the hash table.
int htab_insert(struct htab *ht, char *key, void *value);

// Remove a pair from the table.
// The pair is also freed.
// After this function, the pair can no longer be used.
// The size is updated.
void htab_remove(struct htab *ht, char *key);

#endif

The htab.c File

htab.c
#include <err.h>
#include <string.h>

#include "htab.h"

uint32_t hash(char *key)
{
    // TODO
}

struct htab *htab_new()
{
    // TODO
}

void htab_clear(struct htab *ht)
{
    // TODO
}

void htab_free(struct htab *ht)
{
    // TODO
}

struct pair *htab_get(struct htab *ht, char *key)
{
    // TODO
}

int htab_insert(struct htab *ht, char *key, void *value)
{
    // TODO
}

void htab_remove(struct htab *ht, char *key)
{
    // TODO
}

Implementation

The Hash Function

The first thing we need is a hash function. You will use the one_at_a_time version of the Jenkins hash function. So, write the following function that returns the hash value of the key parameter. (The uint32_t type is defined in the <stdint.h> header file.)

uint32_t hash(char *key);

In order to test your function, here are some hash values:

hash("France")  -> 0xd1f87070
hash("Spain")   -> 0xa394a0be
hash("Jamaica") -> 0x82b25036
hash("Cuba")    -> 0x2238f8bb
hash("Turkey")  -> 0x06371821

Before writing the other functions, you have to understand the internal mechanism of the hash table.

Inserting Key-Value Pairs

Let us assume that we have a table with capacity 4 and size 0 (it is an empty hash table). This table has four unused cells, that is, each cell contains a sentinel. Let us print this table.

(capacity, size, ratio) = (4, 0, 0%)
00
01
02
03

Note: ratio = 100 * size / capacity.

Now, we want to insert the following key-value pair: (France, Paris). First, we need its hash value: 0xd1f87070.

Then, we have to determine the index of the cell where this pair should be stored. To do so, we divide the hash value by the capacity and get the remainder.

0xd1f87070 % 4 = 0

So, the index value will be 0.

Let us print this table after the pair has been inserted.

(capacity, size, ratio) = (4, 1, 25%)
00 -> (0xd1f87070, France, Paris)
01
02
03

Now, let us insert the following key-value pair: (Spain, Madrid). The hash value of "Spain" is 0xa394a0be and 0xa394a0be % 4 = 2. Therefore:

(capacity, size, ratio) = (4, 2, 50%)
00 -> (0xd1f87070, France, Paris)
01
02 -> (0xa394a0be, Spain, Madrid)
03

Now, let us insert the following key-value pair: (Kingston, London). The hash value of "Jamaica" is 0x82b25036 and 0x82b25036 % 4 = 2. Therefore:

(capacity, size, ratio) = (4, 2, 50%)
00 -> (0xd1f87070, France, Paris)
01
02 -> (0x82b25036, Jamaica, Kingston) -> (0xa394a0be, Spain, Madrid)
03

The index values of "Spain" and "Jamaica" are the same. So, they are stored in the same cell. That is the reason why we use linked lists.

It is worthy to note that:

Now, let us insert (Cuba, Havana). The hash value of "Cuba" is 0x2238f8bb and 0x2238f8bb % 4 = 3. Therefore:

(capacity, size, ratio) = (4, 3, 75%)
00 -> (0xd1f87070, France, Paris)
01
02 -> (0x82b25036, Jamaica, Kingston) -> (0xa394a0be, Spain, Madrid)
03 -> (0x2238f8bb, Cuba, Havana)

Now, let us insert (Iraq, Baghdad). The hash value of "Iraq" is 0x25172e4a and 0x25172e4a % 4 = 2. Therefore:

(capacity, size, ratio) = (4, 3, 75%)
00 -> (0xd1f87070, France, Paris)
01
02 -> (0x25172e4a, Iraq, Baghdad) -> (0x82b25036, Jamaica, Kingston) -> (0xa394a0be, Spain, Madrid)
03 -> (0x2238f8bb, Cuba, Havana)

Finally, let us insert (Turkey, Ankara). The hash value of "Turkey" is 0x06371821 and 0x06371821 % 4 = 1.

As we can see, if we insert this pair into the index-1 cell, 100% of the cells will be used. The problem is that it is not really efficient to overload the cells. So let us impose a new rule: if the ratio is greater than 75%, we will double the capacity of the array. If we insert (Turkey, Ankara), the new hash table should look like this:

(capacity, size, ratio) = (8, 5, 62%)
00 -> (0xd1f87070, France, Paris)
01 -> (0x06371821, Turkey, Ankara)
02 -> (0x25172e4a, Iraq, Baghdad)
03 -> (0x2238f8bb, Cuba, Havana)
04
05
06 -> (0xa394a0be, Spain, Madrid) -> (0x82b25036, Jamaica, Kingston)
07

Be careful, when the capacity is doubled, all the index values must be recalculated. That is the reason why (Spain, Madrid) and (Jamaica, Kingston) are now stored in the cell of index 6.

Searching Key-Value Pairs

To search a key-value pair:

  1. Find its index value in the array.
  2. Find the hash value of the key in the linked list.
  3. As collisions may happen, check that the keys are the same. To compare the keys, you can use the strcmp() function of the standard library.

Complete the htab.c file. Just replace the TODO comments by your own code. (This file must not contain a main function.)

You will submit only the htab.c and htab.h files.

You will test your functions on your own.