OpenMP - Introduction of Parallelism
PThreads are based on OS features. It has high overhead for thread creation. Also, it is quite low level, which means more and harder-found data race bugs. Also, deadlocks are usually easier.
Introduction
As almost all machines today are using multi-core CPU, we need simpler, "lighter" syntax way to write multithreading code. An alternative is OpenMP.
Using PThreads with
compile (usinggcc) with Basic directives
OpenMP uses "hints" or compiler "directives" as to what intended to parallelize
Library functions
Also, library functions included in omp.h for providing information about currently running program
/* used within parallel */
int omp_get_num_threads(); // return #threads running to the closest block
int omp_get_thread_num(); // get thread id
int omp_in_parallel(); // return non-zero if within a parallel region
/* used anywhere */
int omp_get_max_threads() // max #threads can be created;
int omp_get_num_procs(); // #processors available
void omp_set_num_threads(int n); // set #threads for the next parallel section
Basic Example
int main() {
omp_set_num_threads(8);
#pragma omp parallel {
omp_get_thread_num(); // 1 ... 8, no guarantees on order,
omp_get_num_threads(); // 8;
}
omp_get_num_threads(); // 1
return 0;
}
Directive Clauses
// only parallel if expr holds
#pragma omp parallel if(expr)
// #threads, overrides omp_set_num_threads or env variable OMP_NUM_THREADS
#pragma omp parallel num_threads(8)
Nested parallelism
OpenMP supported arbitrarily deep nesting of omp parallel, whenever there are enough threads. However, omp_get_num_threads, omp_get_thread_num will only get number of threads within its group/parallel region.
#pragma omp parallel num_threads(2) {
omp_get_num_threads(); // 2
#pragma omp parallel num_threads(2) {
// there are 4 threads/cores running
omp_get_num_threads(); // 2
}
omp_get_num_threads(); // 2
}
Variable Semantics
Data sharing / scope of variables
-private: each thread get a local copy -shared: all threads share the same copy -firstprivate: like private, but init to the value before the parallel directive
int main() {
int A = 0, B = 1, C = 2, D = 3, E = 4;
#pragma omp parallel private(A, B) shared(C) firstprivate(D) {
printf("%d\n", A); // undeclared
B = 5;
printf("%d\n", B); // 5
printf("%d\n", C); // 2
C = -2;
printf("%d\n", C); // -2
printf("%d\n", D); // 3
D = -3;
printf("%d\n", D); // -3
printf("%d\n", E); // 4
E = -4;
printf("%d\n", E); // -4
}
printf("%d\n", A); // 0
printf("%d\n", B); // 1
printf("%d\n", C); // -2
printf("%d\n", D); // 3
printf("%d\n", E); // -4
default State
use default clause to set default state
shared: unless specified using private/firstprivate, all variables are shared even variables init inside the parallel block
none: must specify every variable used in parallel region otherwise compile error no default clause: variable declared outside a parallel block is shared (with some exceptions like loop counters). inside is implicitly private. Reduction
Reduction refers to the operations where multiple local copies of a variable are combined into a single copy at the master when the parallel block ends.
Syntax
For example,
int s = 0;
#pragma omp parallel reduction(+: s) {
s += 10;
}
/* reduction can be considered as the following
but it guarantees no sync bugs and no contentions
*/
int s = 0;
#pragma omp parallel {
int s_local = s; // get a private copy
s_local += 10; // do operations
atomic_add(s, s_local); // atomic add as specified reduction ops
}
+, -, *, &, |, ^, &&, || Example: dot product
// compute the dot product of vec a * b
float dot(float *a, float *b, size_t n) {
int s;
#pragma omp parallel reduction(+: s) {
int nthreads = omp_get_num_threads();
int tid = omp_get_thread_num();
int chunk = (n + nthreads - 1) / nthreads;
for (int i = tid*chunk; i < tid*chunk + chunk && i < n; i++) {
s += a[i] * b[i];
}
}
}