Submission
Due Date
By Friday 22 February 2019 23:42
Directory Hierarchy
Create your git repository (replace john.smith by your own login).
$ git clone git@git.cri.epita.net:p/2022-s4-tp/tp03-john.smith
It must contain the following files and directories:
-
pw_03_threads/
- AUTHORS
basics/
- hello.c
- hello_id.c
- hello_parity.c
- Makefile
sums/
- split.c
- dnc.c
- Makefile
The AUTHORS
file
must contain the following information.
First Name
Family Name
Login
Email Address
The last character of your AUTHORS
file must be a newline character.
For instance:
$ cat AUTHORS
John
Smith
john.smith
john.smith@epita.fr
$ # Command prompt ready for the next command...
Be careful, if you do not follow all the given instructions, no point will be given to your answers.
Basics
The Makefile
For this section, a Makefile is provided.
# Makefile
CC = gcc
CPPFLAGS =
CFLAGS = -Wall -Wextra
LDFLAGS = -pthread
LDLIBS =
EXE = hello hello_id hello_parity
all: ${EXE}
hello:
hello_id:
hello_parity:
.PHONY: clean
clean:
${RM} ${EXE}
# END
Hello From Threads
Provided Files
// TODO: Insert the 'include' directives.
void* fn_thread(void* arg);
int main(int argc, char** argv)
{
// Check the number of arguments.
if (argc < 2)
errx(EXIT_FAILURE, "The number of threads is missing.");
else if (argc > 2)
errx(EXIT_FAILURE, "Specify only the number of threads.");
// TODO
// - Convert the argument into a long integer.
// Use atol(3).
// This value represents the number of threads.
// - If the argument is not valid (i.e. lower than or equal to zero),
// exit with an error message.
// - Create and execute the threads.
// If an error occurs, exit with an error message.
// You can use err(3), but the 'errno' variable is not set automatically.
// You have to set it manually to the return value of pthread_create().
// - Wait for the threads to end.
// - Return from the function.
}
// Define the thread function.
void* fn_thread(void* arg __attribute__((unused)))
{
// TODO
// - Print a message.
// - Return from the function.
}
Implementation
Here is a quote from GCC's documentation:
The keyword __attribute__ allows you to specify special attributes of variables or structure fields. This keyword is followed by an attribute specification inside double parentheses. [...]
- unused
- This attribute, attached to a variable, means that the variable is meant to be possibly unused. GCC will not produce a warning for this variable.
In other words, you do not have to use the arg parameter
of the fn_thread()
function.
Your program must execute the number of threads that was passed in.
Each thread prints the same message in the terminal.
Complete the hello.c
file in order to have the same following behavior.
$ make hello
gcc -Wall -Wextra -pthread hello.c -o hello
$ ./hello
hello: The number of threads is missing.
$ ./hello 3
Hello from thread!
Hello from thread!
Hello from thread!
$ ./hello 10
Hello from thread!
Hello from thread!
Hello from thread!
Hello from thread!
Hello from thread!
Hello from thread!
Hello from thread!
Hello from thread!
Hello from thread!
Hello from thread!
$ ./hello yes
hello: The number of threads is not valid.
$ ./hello -1
hello: The number of threads is not valid.
$ ./hello 1
Hello from thread!
Hello From Threads with ID
Copy hello.c
to hello_id.c
.
$ cp hello.c hello_id.c
In this section, you will modify hello_id.c
so that each thread prints a message with its id number.
You will generate the id number in the
main()
function.
The type of the id number must be long
(as the type of the number of threads).
You will pass the id numbers to the threads when you create them.
So, you will no longer need the unused attribute.
Note that you can explicitly cast a void*
type into a long
type and vice versa.
In order to simulate different execution times for each thread,
you will call the sleep()
function
with a random parameter.
For instance:
void* fn_thread(void* arg)
{
sleep(rand() % 3);
// TODO
}
The expected result is as follows (the id order can differ from one computer to another):
$ make hello_id
gcc -Wall -Wextra -pthread hello_id.c -o hello_id
$ ./hello_id 1
Hello from thread 0!
$ ./hello_id 3
Hello from thread 2!
Hello from thread 0!
Hello from thread 1!
$ ./hello_id 8
Hello from thread 2!
Hello from thread 7!
Hello from thread 0!
Hello from thread 5!
Hello from thread 3!
Hello from thread 1!
Hello from thread 6!
Hello from thread 4!
You can check that all the threads printed a message by sorting the output.
$ ./hello_id 15 | sort -V
Hello from thread 0!
Hello from thread 1!
Hello from thread 2!
Hello from thread 3!
Hello from thread 4!
Hello from thread 5!
Hello from thread 6!
Hello from thread 7!
Hello from thread 8!
Hello from thread 9!
Hello from thread 10!
Hello from thread 11!
Hello from thread 12!
Hello from thread 13!
Hello from thread 14!
Or, for a large number of thread:
$ ./hello_id 1000 | wc -l
1000
Hello From Threads with Parity
Copy hello_id.c
to hello_parity.c
.
$ cp hello_id.c hello_parity.c
In this section, you will modify hello_parity.c
so that each thread prints:
"Hello from an even thread."
if its id is even."Hello from an odd thread."
if its id is odd.
But you will not pass the id to the threads. You will only pass the message to print, which will not contain any newline characters. The line break will be done by the threads.
Keep the calls to the sleep()
and rand()
functions.
The expected result is as follows (the order can differ from one computer to another):
$ make hello_parity
gcc -Wall -Wextra -pthread hello_parity.c -o hello_parity
$ ./hello_parity 3
Hello from an even thread.
Hello from an even thread.
Hello from an odd thread.
$ ./hello_parity 8
Hello from an even thread.
Hello from an odd thread.
Hello from an even thread.
Hello from an odd thread.
Hello from an odd thread.
Hello from an odd thread.
Hello from an even thread.
Hello from an even thread.
$ ./hello_parity 50 | grep even | wc -l
25
Parallel Sum Algorithms
Introduction
In this part, we want to write multithreaded programs that work out the sum of an array of bytes (we will consider that a byte is an 8-bit unsigned integer). In other words, we want to parallelize some sum algorithms.
The performance of such algorithms depends on the number of CPUs a computer has. For example, if your computer has four CPUs, you can use four threads at the most to improve the performance of an algorithm.
The most efficient strategy is to assign one thread per CPU, but some CPUs can already be used by any other processes currently running on the system. So, even if your computer has four CPUs, only two may be available for your program.
But for sure, if the number of threads is greater than the number of CPUs, there is no gain in parallelizing algorithms, and even worse, it can decrease performance in most cases.
To know the number of CPUs your computer has, you can type the following command:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
Stepping: 3
CPU MHz: 1028.660
CPU max MHz: 4400,0000
CPU min MHz: 800,0000
BogoMIPS: 7995.29
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
# ...snip...
The above architecture will be used for all the examples of this section.
The Makefile
For this section, a Makefile is provided.
# Makefile
CC = gcc
CPPFLAGS =
CFLAGS = -Wall -Wextra -O3
LDFLAGS = -pthread
LDLIBS =
EXE = split dnc
all: ${EXE}
split:
dnc:
.PHONY: clean
clean:
${RM} ${EXE}
# END
Splitting Arrays into Chunks
Introduction
In a first approach to parallelizing a sum algorithm, we can split the array into N chunks and create N threads. Each thread will then work out the sum of one chunk. Finally, we can add up all the sums generated by the threads to get the sum of the array.
Provided Files
// TODO: Insert the 'include' directives.
#define INITIALIZE_ARRAY
struct thread_data
{
long id; // Thread ID.
pthread_t sys_id; // Thread system ID.
unsigned char *start; // Points to the first element to add.
long size; // Number of elements to add.
long sum; // Sum of the elements.
};
// Return the sum of an array of bytes.
// 'start' points to the first element of an array.
// 'size' is the number of elements of the array.
unsigned long linear_sum(unsigned char *start, long size)
{
// TODO
}
// Define the thread function.
void * worker(void *arg)
{
// TODO
// - Get the thread data passed as parameters.
// - Call linear_sum().
// - Store the result in the 'sum' field.
// - Print the thread ID and the result.
// - Return from the function.
}
int main(int argc, char **argv)
{
// Get the arguments.
// argv[1] = array_size = the size of the array of bytes (greater than 63).
// argv[2] = thread_number = the number of threads (between 1 and 16).
// -----------------------------------------------------------------------
if (argc != 3)
errx(EXIT_FAILURE, "Usage: array_size thread_number");
long array_size = atol(argv[1]);
if (array_size <= 63)
errx(EXIT_FAILURE, "The size of the array is not valid (must be greater than 63).");
long thread_number = atol(argv[2]);
if (thread_number < 1 || thread_number > 16)
errx(EXIT_FAILURE, "The number of threads is not valid (between 1 and 16).");
// -----------------------------------------------------------------------
// Data for the threads.
struct thread_data data[thread_number];
// Allocate the array of bytes.
unsigned char *bytes = malloc(array_size);
if (bytes == NULL)
errx(EXIT_FAILURE, "Not enough memory!\nTry to reduce the size of the array.");
#ifdef INITIALIZE_ARRAY
// Initialize the array.
printf("Initializing array........");
fflush(stdout);
for (long i = 0; i < array_size; i++)
bytes[i] = 1;
printf("OK\n");
#endif
// TODO
// - Determine the size of a chunk.
// Be careful, the size of the last chunk may include some remaining bytes.
// - For each chunk:
// - Set the thread_data structure of the thread.
// This structure is stored in the 'data' array.
// (The thread ID should be used as index.)
// - Execute the thread.
// Wait for the threads and add up their sums.
// Print the sum of the array.
// Free the array.
// Return from the function.
}
Implementation
The first step is to implement the linear_sum()
function.
This function adds all the elements of an array and returns the result.
You will use the worker()
function for all of your threads.
You will pass it a thread_data
structure.
To do so, you can use the fourth parameter of pthread_create()
.
The value of the id
field of thread_data
must be generated in your main()
function.
For example, you will assign 0 to id
for the first thread you create.
Then 1 for the second thread. And so on and so forth.
Complete the provided split.c
file.
Testing
Your program will take two arguments:
- The size of the array.
- The number of threads, which is limited to sixteen.
First, you have to be sure that your code prints the right sum. To do this, you can set all the elements of your array to one. This way, the sum of the array will be equal to its size. Some code is already provided to initialize the array:
If you define INITIALIZE_ARRAY
,
the whole array will be initialized with ones.
#define INITIALIZE_ARRAY
For instance:
$ make split
gcc -Wall -Wextra -O3 -pthread split.c -o split
$ ./split 100 1
Initializing array........OK
Thread 00: 100
Sum......: 100
$ ./split 100 2
Initializing array........OK
Thread 00: 50
Thread 01: 50
Sum......: 100
$ ./split 100 3
Initializing array........OK
Thread 00: 33
Thread 01: 33
Thread 02: 34
Sum......: 100
$ ./split 100 8
Initializing array........OK
Thread 00: 12
Thread 01: 12
Thread 04: 12
Thread 02: 12
Thread 03: 12
Thread 06: 12
Thread 05: 12
Thread 07: 16
Sum......: 100
Once your code prints the right sum, comment out the line
that defines INITIALIZE_ARRAY
and recompile your code.
// #define INITIALIZE_ARRAY
Now, you can test the performance of your code with a large array. Try different numbers of threads and see the differences.
Let us see some examples executed on an 8-CPU computer.
$ time ./split 1000000000 1
Thread 00: 0
Sum......: 0
real 0m0,311s
user 0m0,223s
sys 0m0,088s
$ time ./split 1000000000 2
Thread 01: 0
Thread 00: 0
Sum......: 0
real 0m0,164s
user 0m0,220s
sys 0m0,104s
$ time ./split 1000000000 4
Thread 00: 0
Thread 01: 0
Thread 02: 0
Thread 03: 0
Sum......: 0
real 0m0,106s
user 0m0,297s
sys 0m0,080s
$ time ./split 1000000000 8
Thread 07: 0
Thread 01: 0
Thread 03: 0
Thread 00: 0
Thread 02: 0
Thread 05: 0
Thread 06: 0
Thread 04: 0
Sum......: 0
real 0m0,079s
user 0m0,492s
sys 0m0,103s
$ time ./split 1000000000 16
Thread 00: 0
Thread 08: 0
Thread 11: 0
Thread 10: 0
Thread 06: 0
Thread 03: 0
Thread 02: 0
Thread 07: 0
Thread 01: 0
Thread 14: 0
Thread 05: 0
Thread 09: 0
Thread 04: 0
Thread 13: 0
Thread 12: 0
Thread 15: 0
Sum......: 0
real 0m0,082s
user 0m0,500s
sys 0m0,106s
Look at the real time only. You can see that the performance increases up to 8 threads (not that much from 4 to 8 threads). Also, note that it is useless to use 16 threads.
Divide and Conquer
Introduction
In a second approach to parallelizing a sum algorithm, we can use a divide-and-conquer strategy.
Provided Files
// TODO: Insert the 'include' directives.
#define INITIALIZE_ARRAY
void * worker(void *arg);
struct thread_data
{
unsigned char *start; // Points to the first element to add.
long size; // Number of elements to add.
long threshold; // Threshold.
long sum; // Sum of the elements.
};
// Return the sum of an array of bytes.
// 'start' points to the first element of an array.
// 'size' is the number of elements of the array.
unsigned long linear_sum(unsigned char *start, long size)
{
// TODO: same function as previously.
}
unsigned long dnc_sum(unsigned char *start, long size, long threshold)
{
// TODO
// Implement the divide-and-conquer algorithm.
}
// Counter of threads.
atomic_int thread_count = 1;
// The thread function.
void * worker(void *arg)
{
// Increment the counter of threads.
atomic_fetch_add(&thread_count, 1);
// TODO
// - Get the thread data passed as parameters.
// - Call dnc_sum().
// (It may execute recursively another thread.)
// - Store the result in the 'sum' field.
// - Return from the function.
}
int main(int argc, char **argv)
{
// Get the arguments.
// argv[1] = array_size = the size of the array of bytes (greater than 63).
// argv[2] = thread_number = the minimum number of threads (between 1 and 16).
// -----------------------------------------------------------------------
if (argc != 3)
errx(EXIT_FAILURE, "Usage: array_size thread_number");
long array_size = atol(argv[1]);
if (array_size <= 63)
errx(EXIT_FAILURE, "The size of the array is not valid (must be greater than 63).");
long thread_number = atol(argv[2]);
if (thread_number < 1 || thread_number > 16)
errx(EXIT_FAILURE, "The number of threads is not valid (between 1 and 16).");
// -----------------------------------------------------------------------
// Allocate the array of bytes.
unsigned char *bytes = malloc(array_size);
if (bytes == NULL)
errx(EXIT_FAILURE, "Not enough memory!\nTry to reduce the size of the array.");
#ifdef INITIALIZE_ARRAY
// Initialize the array.
printf("Initializing array.. ");
fflush(stdout);
for (long i = 0; i < array_size; i++)
bytes[i] = 1;
printf("OK\n");
#endif
// Print the sum and the number of threads.
printf("Sum................. %li\n", dnc_sum(bytes, array_size, 1 + array_size / thread_number));
printf("Number of threads... %i\n", atomic_load(&thread_count));
// Free the array and exit.
free(bytes);
return EXIT_SUCCESS;
}
Implementation
Do not pay particular attention to the atomic thread_count variable. It is just used to count and display the number of threads.
The divide-and-conquer algorithm is as follows:
if size <= threshold:
return linear_sum(start, size)
size1 = size / 2
size2 = size - size1
mid = start + size1
s1 = dnd_sum(start, size1, threshold) # Use a thread.
s2 = dnd_sum(mid, size2, threshold)
# Wait for s1.
return s1 + s2
For s1 you must use a thread.
To do so, fill a thread_data
structure
and pass it to worker()
via the fourth parameter of
pthread_create().
It is useless to create a thread for s2 and wait for its result. The current thread can do it just as well.
Note that when the main()
function
calls dnc_sum()
,
the current thread is the main thread.
It means that the main thread should be taken into account in the thread counter.
That is why the thread_count variable is initialized to one and not zero.
Testing
Your program will take two arguments:
- The size of the array.
- The minimum number of threads, which is limited to sixteen.
If your program implements the right algorithm, the maximum number of threads should be the next power of two greater than or equal to the argument. For instance:
- If the minimum number of threads is set to 8, the number of threads should be 8.
- If the minimum number of threads is set to 9, the number of threads should be between 9 and 16.
In the same way as previously,
you can define INITIALIZE_ARRAY
to initialize the array.
For instance:
$ make dnc
gcc -Wall -Wextra -O3 -pthread dnc.c -o dnc
$ ./dcn 100 1
bash: ./dcn: No such file or directory
$ ./dnc 100 1
Initializing array.. OK
Sum................. 100
Number of threads... 1
$ ./dnc 100 2
Initializing array.. OK
Sum................. 100
Number of threads... 2
$ ./dnc 100 3
Initializing array.. OK
Sum................. 100
Number of threads... 4
$ ./dnc 100 8
Initializing array.. OK
Sum................. 100
Number of threads... 8
Once your code prints the right sum, comment out the line
that defines INITIALIZE_ARRAY
and recompile your code.
Now, you can test the performance of your code with a large array. Try different numbers of threads and see the differences.
Let us see some examples executed on an 8-CPU computer.
$ time ./dnc 1000000000 1
Sum................. 0
Number of threads... 1
real 0m0,310s
user 0m0,250s
sys 0m0,060s
$ time ./dnc 1000000000 2
Sum................. 0
Number of threads... 2
real 0m0,163s
user 0m0,232s
sys 0m0,092s
$ time ./dnc 1000000000 4
Sum................. 0
Number of threads... 4
real 0m0,089s
user 0m0,225s
sys 0m0,120s
$ time ./dnc 1000000000 8
Sum................. 0
Number of threads... 8
real 0m0,078s
user 0m0,521s
sys 0m0,065s
$ time ./dnc 1000000000 16
Sum................. 0
Number of threads... 16
real 0m0,085s
user 0m0,480s
sys 0m0,086s
The performance is similar to that of the previous algorithm.