Practical Programming David Bouchet
Practical Work #3

Threads

Submission

Due Date

By Friday 22 February 2019 23:42

Directory Hierarchy

Create your git repository (replace john.smith by your own login).

$ git clone git@git.cri.epita.net:p/2022-s4-tp/tp03-john.smith

It must contain the following files and directories:

The AUTHORS file must contain the following information.

AUTHORS
First Name
Family Name
Login
Email Address

The last character of your AUTHORS file must be a newline character.

For instance:

AUTHORS
$ cat AUTHORS
John
Smith
john.smith
john.smith@epita.fr
$ # Command prompt ready for the next command...

Be careful, if you do not follow all the given instructions, no point will be given to your answers.

Basics

The Makefile

For this section, a Makefile is provided.

Makefile
# Makefile

CC = gcc
CPPFLAGS =
CFLAGS = -Wall -Wextra
LDFLAGS = -pthread
LDLIBS =

EXE = hello hello_id hello_parity

all: ${EXE}

hello:
hello_id:
hello_parity:

.PHONY: clean

clean:
	${RM} ${EXE}

# END

Hello From Threads

Provided Files

hello.c
// TODO: Insert the 'include' directives.

void* fn_thread(void* arg);

int main(int argc, char** argv)
{
    // Check the number of arguments.
    if (argc < 2)
        errx(EXIT_FAILURE, "The number of threads is missing.");
    else if (argc > 2)
        errx(EXIT_FAILURE, "Specify only the number of threads.");

    // TODO
    // - Convert the argument into a long integer.
    //   Use atol(3).
    //   This value represents the number of threads.
    // - If the argument is not valid (i.e. lower than or equal to zero),
    //   exit with an error message.
    // - Create and execute the threads.
    //   If an error occurs, exit with an error message.
    //   You can use err(3), but the 'errno' variable is not set automatically.
    //   You have to set it manually to the return value of pthread_create().
    // - Wait for the threads to end.
    // - Return from the function.
}

// Define the thread function.
void* fn_thread(void* arg __attribute__((unused)))
{
    // TODO
    // - Print a message.
    // - Return from the function.
}

Implementation

Here is a quote from GCC's documentation:

The keyword __attribute__ allows you to specify special attributes of variables or structure fields. This keyword is followed by an attribute specification inside double parentheses. [...]
unused
This attribute, attached to a variable, means that the variable is meant to be possibly unused. GCC will not produce a warning for this variable.

In other words, you do not have to use the arg parameter of the fn_thread() function.

Your program must execute the number of threads that was passed in. Each thread prints the same message in the terminal. Complete the hello.c file in order to have the same following behavior.

$ make hello
gcc -Wall -Wextra  -pthread  hello.c   -o hello
$ ./hello
hello: The number of threads is missing.
$ ./hello 3
Hello from thread!
Hello from thread!
Hello from thread!
$ ./hello 10
Hello from thread!
Hello from thread!
Hello from thread!
Hello from thread!
Hello from thread!
Hello from thread!
Hello from thread!
Hello from thread!
Hello from thread!
Hello from thread!
$ ./hello yes
hello: The number of threads is not valid.
$ ./hello -1
hello: The number of threads is not valid.
$ ./hello 1
Hello from thread!

Hello From Threads with ID

Copy hello.c to hello_id.c.

$ cp hello.c hello_id.c

In this section, you will modify hello_id.c so that each thread prints a message with its id number. You will generate the id number in the main() function. The type of the id number must be long (as the type of the number of threads).

You will pass the id numbers to the threads when you create them. So, you will no longer need the unused attribute. Note that you can explicitly cast a void* type into a long type and vice versa.

In order to simulate different execution times for each thread, you will call the sleep() function with a random parameter. For instance:

void* fn_thread(void* arg)
{
    sleep(rand() % 3);
    // TODO
}

The expected result is as follows (the id order can differ from one computer to another):

$ make hello_id
gcc -Wall -Wextra  -pthread  hello_id.c   -o hello_id
$ ./hello_id 1
Hello from thread 0!
$ ./hello_id 3
Hello from thread 2!
Hello from thread 0!
Hello from thread 1!
$ ./hello_id 8
Hello from thread 2!
Hello from thread 7!
Hello from thread 0!
Hello from thread 5!
Hello from thread 3!
Hello from thread 1!
Hello from thread 6!
Hello from thread 4!

You can check that all the threads printed a message by sorting the output.

$ ./hello_id 15 | sort -V
Hello from thread 0!
Hello from thread 1!
Hello from thread 2!
Hello from thread 3!
Hello from thread 4!
Hello from thread 5!
Hello from thread 6!
Hello from thread 7!
Hello from thread 8!
Hello from thread 9!
Hello from thread 10!
Hello from thread 11!
Hello from thread 12!
Hello from thread 13!
Hello from thread 14!

Or, for a large number of thread:

$ ./hello_id 1000 | wc -l
1000

Hello From Threads with Parity

Copy hello_id.c to hello_parity.c.

$ cp hello_id.c hello_parity.c

In this section, you will modify hello_parity.c so that each thread prints:

But you will not pass the id to the threads. You will only pass the message to print, which will not contain any newline characters. The line break will be done by the threads.

Keep the calls to the sleep() and rand() functions.

The expected result is as follows (the order can differ from one computer to another):

$ make hello_parity
gcc -Wall -Wextra  -pthread  hello_parity.c   -o hello_parity
$ ./hello_parity 3
Hello from an even thread.
Hello from an even thread.
Hello from an odd thread.
$ ./hello_parity 8
Hello from an even thread.
Hello from an odd thread.
Hello from an even thread.
Hello from an odd thread.
Hello from an odd thread.
Hello from an odd thread.
Hello from an even thread.
Hello from an even thread.
$ ./hello_parity 50 | grep even | wc -l
25

Parallel Sum Algorithms

Introduction

In this part, we want to write multithreaded programs that work out the sum of an array of bytes (we will consider that a byte is an 8-bit unsigned integer). In other words, we want to parallelize some sum algorithms.

The performance of such algorithms depends on the number of CPUs a computer has. For example, if your computer has four CPUs, you can use four threads at the most to improve the performance of an algorithm.

The most efficient strategy is to assign one thread per CPU, but some CPUs can already be used by any other processes currently running on the system. So, even if your computer has four CPUs, only two may be available for your program.

But for sure, if the number of threads is greater than the number of CPUs, there is no gain in parallelizing algorithms, and even worse, it can decrease performance in most cases.

To know the number of CPUs your computer has, you can type the following command:

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               60
Model name:          Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
Stepping:            3
CPU MHz:             1028.660
CPU max MHz:         4400,0000
CPU min MHz:         800,0000
BogoMIPS:            7995.29
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
# ...snip...

The above architecture will be used for all the examples of this section.

The Makefile

For this section, a Makefile is provided.

Makefile
# Makefile

CC = gcc
CPPFLAGS =
CFLAGS = -Wall -Wextra -O3
LDFLAGS = -pthread
LDLIBS =

EXE = split dnc

all: ${EXE}

split:
dnc:

.PHONY: clean

clean:
	${RM} ${EXE}

# END

Splitting Arrays into Chunks

Introduction

In a first approach to parallelizing a sum algorithm, we can split the array into N chunks and create N threads. Each thread will then work out the sum of one chunk. Finally, we can add up all the sums generated by the threads to get the sum of the array.

Provided Files

split.c
// TODO: Insert the 'include' directives.

#define INITIALIZE_ARRAY

struct thread_data
{
    long id;                    // Thread ID.
    pthread_t sys_id;           // Thread system ID.
    unsigned char *start;       // Points to the first element to add.
    long size;                  // Number of elements to add.
    long sum;                   // Sum of the elements.
};

// Return the sum of an array of bytes.
// 'start' points to the first element of an array.
// 'size' is the number of elements of the array.
unsigned long linear_sum(unsigned char *start, long size)
{
    // TODO
}

// Define the thread function.
void * worker(void *arg)
{
    // TODO
    // - Get the thread data passed as parameters.
    // - Call linear_sum().
    // - Store the result in the 'sum' field.
    // - Print the thread ID and the result.
    // - Return from the function.
}

int main(int argc, char **argv)
{
    // Get the arguments.
    // argv[1] = array_size = the size of the array of bytes (greater than 63).
    // argv[2] = thread_number = the number of threads (between 1 and 16).
    // -----------------------------------------------------------------------

    if (argc != 3)
        errx(EXIT_FAILURE, "Usage: array_size thread_number");

    long array_size = atol(argv[1]);
    if (array_size <= 63)
        errx(EXIT_FAILURE, "The size of the array is not valid (must be greater than 63).");

    long thread_number = atol(argv[2]);
    if (thread_number < 1 || thread_number > 16)
        errx(EXIT_FAILURE, "The number of threads is not valid (between 1 and 16).");

    // -----------------------------------------------------------------------

    // Data for the threads.
    struct thread_data data[thread_number];

    // Allocate the array of bytes.
    unsigned char *bytes = malloc(array_size);
    if (bytes == NULL)
        errx(EXIT_FAILURE, "Not enough memory!\nTry to reduce the size of the array.");

#ifdef INITIALIZE_ARRAY
    // Initialize the array.
    printf("Initializing array........");
    fflush(stdout);
    for (long i = 0; i < array_size; i++)
        bytes[i] = 1;
    printf("OK\n");
#endif

    // TODO
    // - Determine the size of a chunk.
    //   Be careful, the size of the last chunk may include some remaining bytes.
    // - For each chunk:
    //     - Set the thread_data structure of the thread.
    //       This structure is stored in the 'data' array.
    //       (The thread ID should be used as index.)
    //     - Execute the thread.
    // Wait for the threads and add up their sums.
    // Print the sum of the array.
    // Free the array.
    // Return from the function.
}

Implementation

The first step is to implement the linear_sum() function. This function adds all the elements of an array and returns the result.

You will use the worker() function for all of your threads. You will pass it a thread_data structure. To do so, you can use the fourth parameter of pthread_create().

The value of the id field of thread_data must be generated in your main() function. For example, you will assign 0 to id for the first thread you create. Then 1 for the second thread. And so on and so forth.

Complete the provided split.c file.

Testing

Your program will take two arguments:

First, you have to be sure that your code prints the right sum. To do this, you can set all the elements of your array to one. This way, the sum of the array will be equal to its size. Some code is already provided to initialize the array:

If you define INITIALIZE_ARRAY, the whole array will be initialized with ones.

#define INITIALIZE_ARRAY

For instance:

$ make split
gcc -Wall -Wextra -O3  -pthread  split.c   -o split
$ ./split 100 1
Initializing array........OK
Thread 00: 100
Sum......: 100
$ ./split 100 2
Initializing array........OK
Thread 00: 50
Thread 01: 50
Sum......: 100
$ ./split 100 3
Initializing array........OK
Thread 00: 33
Thread 01: 33
Thread 02: 34
Sum......: 100
$ ./split 100 8
Initializing array........OK
Thread 00: 12
Thread 01: 12
Thread 04: 12
Thread 02: 12
Thread 03: 12
Thread 06: 12
Thread 05: 12
Thread 07: 16
Sum......: 100

Once your code prints the right sum, comment out the line that defines INITIALIZE_ARRAY and recompile your code.

// #define INITIALIZE_ARRAY

Now, you can test the performance of your code with a large array. Try different numbers of threads and see the differences.

Let us see some examples executed on an 8-CPU computer.

$ time ./split 1000000000 1
Thread 00: 0
Sum......: 0

real	0m0,311s
user	0m0,223s
sys	0m0,088s
$ time ./split 1000000000 2
Thread 01: 0
Thread 00: 0
Sum......: 0

real	0m0,164s
user	0m0,220s
sys	0m0,104s
$ time ./split 1000000000 4
Thread 00: 0
Thread 01: 0
Thread 02: 0
Thread 03: 0
Sum......: 0

real	0m0,106s
user	0m0,297s
sys	0m0,080s
$ time ./split 1000000000 8
Thread 07: 0
Thread 01: 0
Thread 03: 0
Thread 00: 0
Thread 02: 0
Thread 05: 0
Thread 06: 0
Thread 04: 0
Sum......: 0

real	0m0,079s
user	0m0,492s
sys	0m0,103s
$ time ./split 1000000000 16
Thread 00: 0
Thread 08: 0
Thread 11: 0
Thread 10: 0
Thread 06: 0
Thread 03: 0
Thread 02: 0
Thread 07: 0
Thread 01: 0
Thread 14: 0
Thread 05: 0
Thread 09: 0
Thread 04: 0
Thread 13: 0
Thread 12: 0
Thread 15: 0
Sum......: 0

real	0m0,082s
user	0m0,500s
sys	0m0,106s

Look at the real time only. You can see that the performance increases up to 8 threads (not that much from 4 to 8 threads). Also, note that it is useless to use 16 threads.

Divide and Conquer

Introduction

In a second approach to parallelizing a sum algorithm, we can use a divide-and-conquer strategy.

Provided Files

dnc.c
// TODO: Insert the 'include' directives.

#define INITIALIZE_ARRAY

void * worker(void *arg);

struct thread_data
{
    unsigned char *start;       // Points to the first element to add.
    long size;                  // Number of elements to add.
    long threshold;             // Threshold.
    long sum;                   // Sum of the elements.
};

// Return the sum of an array of bytes.
// 'start' points to the first element of an array.
// 'size' is the number of elements of the array.
unsigned long linear_sum(unsigned char *start, long size)
{
    // TODO: same function as previously.
}

unsigned long dnc_sum(unsigned char *start, long size, long threshold)
{
    // TODO
    // Implement the divide-and-conquer algorithm.
}

// Counter of threads.
atomic_int thread_count = 1;

// The thread function.
void * worker(void *arg)
{
    // Increment the counter of threads.
    atomic_fetch_add(&thread_count, 1);

    // TODO
    // - Get the thread data passed as parameters.
    // - Call dnc_sum().
    //   (It may execute recursively another thread.)
    // - Store the result in the 'sum' field.
    // - Return from the function.
}

int main(int argc, char **argv)
{
    // Get the arguments.
    // argv[1] = array_size = the size of the array of bytes (greater than 63).
    // argv[2] = thread_number = the minimum number of threads (between 1 and 16).
    // -----------------------------------------------------------------------

    if (argc != 3)
        errx(EXIT_FAILURE, "Usage: array_size thread_number");

    long array_size = atol(argv[1]);
    if (array_size <= 63)
        errx(EXIT_FAILURE, "The size of the array is not valid (must be greater than 63).");

    long thread_number = atol(argv[2]);
    if (thread_number < 1 || thread_number > 16)
        errx(EXIT_FAILURE, "The number of threads is not valid (between 1 and 16).");

    // -----------------------------------------------------------------------

    // Allocate the array of bytes.
    unsigned char *bytes = malloc(array_size);
    if (bytes == NULL)
        errx(EXIT_FAILURE, "Not enough memory!\nTry to reduce the size of the array.");

#ifdef INITIALIZE_ARRAY
    // Initialize the array.
    printf("Initializing array.. ");
    fflush(stdout);
    for (long i = 0; i < array_size; i++)
        bytes[i] = 1;
    printf("OK\n");
#endif

    // Print the sum and the number of threads.
    printf("Sum................. %li\n", dnc_sum(bytes, array_size, 1 + array_size / thread_number));
    printf("Number of threads... %i\n", atomic_load(&thread_count));

    // Free the array and exit.
    free(bytes);
    return EXIT_SUCCESS;
}

Implementation

Do not pay particular attention to the atomic thread_count variable. It is just used to count and display the number of threads.

The divide-and-conquer algorithm is as follows:

if size <= threshold:
    return linear_sum(start, size)

size1 = size / 2
size2 = size - size1
mid = start + size1

s1 = dnd_sum(start, size1, threshold)  # Use a thread.
s2 = dnd_sum(mid, size2, threshold)

# Wait for s1.

return s1 + s2

For s1 you must use a thread. To do so, fill a thread_data structure and pass it to worker() via the fourth parameter of pthread_create().

It is useless to create a thread for s2 and wait for its result. The current thread can do it just as well.

Note that when the main() function calls dnc_sum(), the current thread is the main thread. It means that the main thread should be taken into account in the thread counter. That is why the thread_count variable is initialized to one and not zero.

Testing

Your program will take two arguments:

If your program implements the right algorithm, the maximum number of threads should be the next power of two greater than or equal to the argument. For instance:

In the same way as previously, you can define INITIALIZE_ARRAY to initialize the array.

For instance:

$ make dnc
gcc -Wall -Wextra -O3  -pthread  dnc.c   -o dnc
$ ./dcn 100 1
bash: ./dcn: No such file or directory
$ ./dnc 100 1
Initializing array.. OK
Sum................. 100
Number of threads... 1
$ ./dnc 100 2
Initializing array.. OK
Sum................. 100
Number of threads... 2
$ ./dnc 100 3
Initializing array.. OK
Sum................. 100
Number of threads... 4
$ ./dnc 100 8
Initializing array.. OK
Sum................. 100
Number of threads... 8

Once your code prints the right sum, comment out the line that defines INITIALIZE_ARRAY and recompile your code.

Now, you can test the performance of your code with a large array. Try different numbers of threads and see the differences.

Let us see some examples executed on an 8-CPU computer.

$ time ./dnc 1000000000 1
Sum................. 0
Number of threads... 1

real	0m0,310s
user	0m0,250s
sys	0m0,060s
$ time ./dnc 1000000000 2
Sum................. 0
Number of threads... 2

real	0m0,163s
user	0m0,232s
sys	0m0,092s
$ time ./dnc 1000000000 4
Sum................. 0
Number of threads... 4

real	0m0,089s
user	0m0,225s
sys	0m0,120s
$ time ./dnc 1000000000 8
Sum................. 0
Number of threads... 8

real	0m0,078s
user	0m0,521s
sys	0m0,065s
$ time ./dnc 1000000000 16
Sum................. 0
Number of threads... 16

real	0m0,085s
user	0m0,480s
sys	0m0,086s

The performance is similar to that of the previous algorithm.