inCwithMPIandOpenMPMichaelJ.QuinnMichaelJ.Quinn ParallelProgramming

(1)

Parallel Programming

in C with MPI and OpenMP Michael J. Quinn

Michael J. Quinn

(2)

Chapter 17

Shared

Shared - - memory Programming memory Programming

(3)

Outline

  OpenMP OpenMP

  Shared Shared - - memory model memory model

  Parallel Parallel for for loops loops

  Declaring private variables Declaring private variables

  Critical sections Critical sections

  Reductions Reductions

  Performance improvements Performance improvements

  More general data parallelism More general data parallelism

  Functional parallelism Functional parallelism

(4)

OpenMP

  OpenMP: An application programming OpenMP: An application programming

interface (API) for parallel programming on interface (API) for parallel programming on

multiprocessors multiprocessors

  Compiler directives Compiler directives

  Library of support functions Library of support functions

  OpenMP works in conjunction with Fortran, OpenMP works in conjunction with Fortran, C, or C++

C, or C++

(5)

What’ s OpenMP Good For?

  C + OpenMP sufficient to program C + OpenMP sufficient to program multiprocessors

multiprocessors

  C + MPI + OpenMP a good way to program C + MPI + OpenMP a good way to program multicomputers built out of multiprocessors multicomputers built out of multiprocessors

  IBM RS/6000 SP IBM RS/6000 SP

  Fujitsu AP3000 Fujitsu AP3000

  Dell High Performance Computing Dell High Performance Computing Cluster

Cluster

(6)

Shared-memory Model

Processor Processor Processor Processor

Memory

Processors interact and synchronize with each

other through shared variables.

(7)

Fork/Join Parallelism

  Initially only master thread is active Initially only master thread is active

  Master thread executes sequential code Master thread executes sequential code

  Fork: Master thread creates or awakens Fork: Master thread creates or awakens

additional threads to execute parallel code additional threads to execute parallel code

  Join: At end of parallel code created threads Join: At end of parallel code created threads die or are suspended

die or are suspended

(8)

Fork/Join Parallelism

T im e

fork

join Master Thread

fork

join

Other threads

(9)

Message-passing Model (#1)

  Shared Shared - - memory model memory model

  Number active threads 1 at start and Number active threads 1 at start and

finish of program, changes dynamically finish of program, changes dynamically

during execution during execution

  Message Message - - passing model passing model

  All processes active throughout execution All processes active throughout execution of program

of program

(10)

Incremental Parallelization

  Sequential program a special case of a Sequential program a special case of a shared

shared - - memory parallel program memory parallel program

  Parallel shared Parallel shared - - memory programs may only memory programs may only have a single parallel loop

have a single parallel loop

  Incremental parallelization: process of Incremental parallelization: process of

converting a sequential program to a

parallel program a little bit at a time

(11)

Message-passing Model (#2)

  Shared- Shared - memory model memory model

  Execute and profile sequential program Execute and profile sequential program

  Incrementally make it parallel Incrementally make it parallel

  Stop when further effort not warranted Stop when further effort not warranted

  Message- Message -passing model passing model

  Sequential Sequential - - to to - - parallel transformation requires parallel transformation requires major effort

major effort

  Transformation done in one giant step rather Transformation done in one giant step rather than many tiny steps

than many tiny steps

(12)

Parallel for Loops

  C programs often express data- C programs often express data - parallel operations parallel operations as as for for loops loops

for (i = first; i < size; i += prime) for (i = first; i < size; i += prime)

marked[i] = 1;

  OpenMP makes it easy to indicate when the OpenMP makes it easy to indicate when the iterations of a loop may execute in parallel iterations of a loop may execute in parallel

  Compiler takes care of generating code that Compiler takes care of generating code that

forks/joins threads and allocates the iterations to forks/joins threads and allocates the iterations to threads

threads

(13)

Pragmas

  Pragma: a compiler directive in C or C++ Pragma: a compiler directive in C or C++

  Stands for Stands for “ “ pragmatic information pragmatic information ” ”

  A way for the programmer to communicate A way for the programmer to communicate with the compiler

with the compiler

  Compiler free to ignore pragmas Compiler free to ignore pragmas

  Syntax: Syntax:

#pragma

#pragma omp omp <rest of pragma> <rest of pragma>

(14)

Parallel for Pragma

  Format: Format:

#pragma

#pragma omp omp parallel for parallel for for (i = 0; i < N; i++) for (i = 0; i < N; i++)

a[i] = b[i] + c[i];

  Compiler must be able to verify the run Compiler must be able to verify the run - - time system will have information it needs time system will have information it needs

to schedule loop iterations

(15)

Control Clause

)

index index

; index

( for

 





 







 





 



















 

 



 



 



 





 



 

















inc inc

inc inc inc end

start

(16)

Execution Context

  Every thread has its own execution context Every thread has its own execution context

  Execution context: address space containing all of Execution context: address space containing all of the variables a thread may access

the variables a thread may access

  Contents of execution context: Contents of execution context:

  static variables static variables

  dynamically allocated data structures in the dynamically allocated data structures in the heap heap



 variables on the run variables on the run - - time stack time stack

  additional run additional run - - time stack for functions invoked time stack for functions invoked by the thread

by the thread

(17)

Shared and Private Variables

  Shared variable: has same address in Shared variable: has same address in execution context of every thread

execution context of every thread

  Private variable: has different address in Private variable: has different address in execution context of every thread

execution context of every thread

  A thread cannot access the private variables A thread cannot access the private variables of another thread

of another thread

(18)

Shared and Private Variables

int main (int argc, char *argv[]) {

int b[3];

char *cptr;

int i;

cptr = malloc(1);

#pragma omp parallel for for (i = 0; i < 3; i++)

b[i] = i;

Heap Stack

cptr

b i

i i

Master Thread (Thread 0)

Thread 1

(19)

Function omp_get_num_procs

  Returns number of physical processors Returns number of physical processors available for use by the parallel program available for use by the parallel program

int omp

int omp _get_num_ _get_num_ procs procs (void) (void)

(20)

Function omp_set_num_threads

  Uses the parameter value to set the number Uses the parameter value to set the number of threads to be active in parallel sections of of threads to be active in parallel sections of code code

  May be called at multiple points in a May be called at multiple points in a program

program

void void omp omp _set_num_threads ( _set_num_threads ( int int t) t)

(21)

Pop Quiz:

Write a C program segment that sets the Write a C program segment that sets the

number of threads equal to the number of number of threads equal to the number of

processors that are available.

(22)

Declaring Private Variables

for (i = 0; i < BLOCK_SIZE(id,p,n); i++) for (i = 0; i < BLOCK_SIZE(id,p,n); i++)

for (j = 0; j < n; j++) for (j = 0; j < n; j++)

a[i][j] = MIN(a[i][j],a[i][k]+

a[i][j] = MIN(a[i][j],a[i][k]+tmp tmp ); );

  Either loop could be executed in parallel Either loop could be executed in parallel

  We prefer to make outer loop parallel, to reduce We prefer to make outer loop parallel, to reduce number of forks/joins

number of forks/joins

  We then must give each thread its own private We then must give each thread its own private copy of variable

copy of variable j j

(23)

private Clause

  Clause: an optional, additional component Clause: an optional, additional component to a pragma

to a pragma

  Private clause: directs compiler to make one Private clause: directs compiler to make one or more variables private

or more variables private

private (

private ( <variable list> <variable list> ) )

(24)

Example Use of private Clause

#pragma

#pragma omp omp parallel for private(j) parallel for private(j)

for (i = 0; i < BLOCK_SIZE(id,p,n); i++) for (i = 0; i < BLOCK_SIZE(id,p,n); i++)

for (j = 0; j < n; j++) for (j = 0; j < n; j++)

a[i][j] = MIN(a[i][j],a[i][k]+

a[i][j] = MIN(a[i][j],a[i][k]+ tmp tmp ); );

(25)

firstprivate Clause

  Used to create private variables having initial Used to create private variables having initial values identical to the variable controlled by the values identical to the variable controlled by the master thread as the loop is entered

master thread as the loop is entered

  Variables are initialized once per thread, not once Variables are initialized once per thread, not once per loop iteration

per loop iteration

  If a thread modifies a variable’ If a thread modifies a variable ’ s value in an s value in an iteration, subsequent iterations will get the iteration, subsequent iterations will get the modified value

modified value

(26)

lastprivate Clause

  Sequentially last iteration: iteration that Sequentially last iteration: iteration that occurs last when the loop is executed occurs last when the loop is executed

sequentially sequentially

  lastprivate lastprivate clause: used to copy back clause: used to copy back to the master thread

to the master thread ’ ’ s copy of a variable the s copy of a variable the private copy of the variable from the thread private copy of the variable from the thread

that executed the sequentially last iteration

(27)

Critical Sections

double area, pi, x;

int i, n;

...

area = 0.0;

for (i = 0; i < n; i++) { x += (i+0.5)/n;

**area += 4.0/(1.0 + x*x);**

}

pi = area / n;

(28)

Race Condition

  Consider this C program segment to Consider this C program segment to compute

compute   using the rectangle rule: using the rectangle rule:

double area, pi, x;

int i, n;

...

area = 0.0;

for (i = 0; i < n; i++) { x = (i+0.5)/n;

**area += 4.0/(1.0 + x*x);**

}

pi = area / n;

(29)

Race Condition (cont.)

  If we simply parallelize the loop... If we simply parallelize the loop...

double area, pi, x;

int i, n;

...

area = 0.0;

#pragma omp parallel for private(x) for (i = 0; i < n; i++) {

x = (i+0.5)/n;

**area += 4.0/(1.0 + x*x);**

}

pi = area / n;

(30)

Race Condition (cont.)

  ... we set up a race condition in which one ... we set up a race condition in which one process may

process may “ “ race ahead race ahead ” ” of another and of another and not see its change to shared variable

not see its change to shared variable area area

11.667 area

**area += 4.0/(1.0 + x*x)**

Thread A Thread B

15.432 11.667 11.667

15.432 15.230

15.230 Answer should be 18.995

(31)

Race Condition Time Line

Thread A Thread B

Value of area

11.667 + 3.765

+ 3.563 11.667

15.432

15.230

(32)

critical Pragma

  Critical section: a portion of code that only Critical section: a portion of code that only thread at a time may execute

thread at a time may execute

  We denote a critical section by putting the We denote a critical section by putting the pragma

pragma

#pragma

#pragma omp omp critical critical

in front of a block of C code

(33)

Correct, But Inefficient, Code

double area, pi, x;

int i, n;

...

area = 0.0;

#pragma omp parallel for private(x) for (i = 0; i < n; i++) {

x = (i+0.5)/n;

#pragma omp critical

**area += 4.0/(1.0 + x*x);**

}

pi = area / n;

(34)

Source of Inefficiency

  Update to Update to area area inside a critical section inside a critical section

  Only one thread at a time may execute the Only one thread at a time may execute the statement; i.e., it is sequential code

statement; i.e., it is sequential code

  Time to execute statement significant part Time to execute statement significant part of loop

of loop

  By Amdahl By Amdahl ’ ’ s Law we know speedup will be s Law we know speedup will be severely constrained

severely constrained

(35)

Reductions

  Reductions are so common that OpenMP provides Reductions are so common that OpenMP provides support for them

support for them

  May add reduction clause to May add reduction clause to parallel for parallel for pragma

pragma

  Specify reduction operation and reduction variable Specify reduction operation and reduction variable

  OpenMP takes care of storing partial results in OpenMP takes care of storing partial results in private variables and combining partial results private variables and combining partial results after the loop

after the loop

(36)

reduction Clause

  The reduction clause has this syntax: The reduction clause has this syntax:

reduction (

reduction ( <op> <op> : : <variable> <variable> ) )

  Operators Operators

  + + Sum Sum

  * * Product Product

  & & Bitwise Bitwise and and

  | | Bitwise Bitwise or or



 ^ ^ Bitwise Bitwise exclusive or exclusive or

  && && Logical and Logical and



 || || Logical or Logical or

(37)

-finding Code with Reduction Clause

double area, pi, x;

int i, n;

...

area = 0.0;

#pragma omp parallel for \

private(x) reduction(+:area) for (i = 0; i < n; i++) {

x = (i + 0.5)/n;

**area += 4.0/(1.0 + x*x);**

}

pi = area / n;

(38)

Performance Improvement #1

  Too many fork/joins can lower performance Too many fork/joins can lower performance

  Inverting loops may help performance if Inverting loops may help performance if

  Parallelism is in inner loop Parallelism is in inner loop

  After inversion, the outer loop can be After inversion, the outer loop can be made parallel

made parallel

  Inversion does not significantly lower Inversion does not significantly lower cache hit rate

cache hit rate

(39)

Performance Improvement #2

  If loop has too few iterations, fork/join If loop has too few iterations, fork/join

overhead is greater than time savings from overhead is greater than time savings from

parallel execution parallel execution

  The The if if clause instructs compiler to insert clause instructs compiler to insert code that determines at run

code that determines at run - - time whether time whether loop should be executed in parallel; e.g., loop should be executed in parallel; e.g.,

#pragma

#pragma omp omp parallel for if(n > 5000) parallel for if(n > 5000)

(40)

Performance Improvement #3

  We can use We can use schedule schedule clause to specify how clause to specify how iterations of a loop should be allocated to threads iterations of a loop should be allocated to threads

  Static schedule: all iterations allocated to threads Static schedule: all iterations allocated to threads before any iterations executed

before any iterations executed

  Dynamic schedule: only some iterations allocated Dynamic schedule: only some iterations allocated to threads at beginning of loop

to threads at beginning of loop’ ’ s execution. s execution.

Remaining iterations allocated to threads that Remaining iterations allocated to threads that complete their assigned iterations.

complete their assigned iterations.

(41)

Static vs. Dynamic Scheduling

  Static scheduling Static scheduling

  Low overhead Low overhead

  May exhibit high workload imbalance May exhibit high workload imbalance

  Dynamic scheduling Dynamic scheduling

  Higher overhead Higher overhead

  Can reduce workload imbalance Can reduce workload imbalance

(42)

Chunks

  A chunk is a contiguous range of iterations A chunk is a contiguous range of iterations

  Increasing chunk size reduces overhead and Increasing chunk size reduces overhead and may increase cache hit rate

may increase cache hit rate

  Decreasing chunk size allows finer Decreasing chunk size allows finer balancing of workloads

balancing of workloads

(43)

schedule Clause

  Syntax of schedule clause Syntax of schedule clause schedule (

schedule ( <type> <type> [, [, <chunk> <chunk> ]) ])

  Schedule type required, chunk size optional Schedule type required, chunk size optional

  Allowable schedule types Allowable schedule types

  static: static allocation static: static allocation

  dynamic: dynamic allocation dynamic: dynamic allocation

  guided: guided self guided: guided self - - scheduling scheduling

  runtime: type chosen at run runtime: type chosen at run - - time based on value time based on value of environment variable OMP_SCHEDULE

of environment variable OMP_SCHEDULE

(44)

Scheduling Options

  schedule(static): block allocation of about schedule(static): block allocation of about n/t contiguous iterations to each thread

n/t contiguous iterations to each thread

  schedule(static,C): interleaved allocation of schedule(static,C): interleaved allocation of chunks of size C to threads

chunks of size C to threads

  schedule(dynamic): dynamic one schedule(dynamic): dynamic one - - at at - - a a - - time time allocation of iterations to threads

allocation of iterations to threads

  schedule(dynamic,C): dynamic allocation of schedule(dynamic,C): dynamic allocation of C iterations at a time to threads

C iterations at a time to threads

(45)

Scheduling Options (cont.)

  schedule(guided, C): dynamic allocation of chunks schedule(guided, C): dynamic allocation of chunks to tasks using guided self

to tasks using guided self - - scheduling heuristic. scheduling heuristic.

Initial chunks are bigger, later chunks are smaller, Initial chunks are bigger, later chunks are smaller, minimum chunk size is C.

minimum chunk size is C.

  schedule(guided): guided self schedule(guided): guided self - - scheduling with scheduling with minimum chunk size 1

minimum chunk size 1

  schedule(runtime): schedule chosen at run schedule(runtime): schedule chosen at run - - time time based on value of OMP_SCHEDULE; Unix

based on value of OMP_SCHEDULE; Unix example:

example:

setenv

setenv OMP_SCHEDULE OMP_SCHEDULE “ “ static,1 static,1 ” ”

(46)

More General Data Parallelism

  Our focus has been on the parallelization of Our focus has been on the parallelization of for for loops loops

  Other opportunities for data parallelism Other opportunities for data parallelism

  processing items on a processing items on a “ “ to do to do ” ” list list

  for for loop + additional code outside of loop + additional code outside of

loop loop

(47)

Processing a “ To Do”List

Heap

job_ptr

Shared Variables

Master Thread Thread 1

task_ptr task_ptr

(48)

Sequential Code (1/2)

**int main (int argc, char *argv[])** {

struct job_struct *job_ptr;

**struct task_struct *task_ptr;**

...

task_ptr = get_next_task (&job_ptr);

while (task_ptr != NULL) { complete_task (task_ptr);

task_ptr = get_next_task (&job_ptr);

} ...

}

(49)

Sequential Code (2/2)

**char *get_next_task(struct job_struct**

job_ptr) { struct task_struct *answer;**

**if (*job_ptr == NULL) answer = NULL;**

else {

**answer = (*job_ptr)->task;**

job_ptr = (job_ptr)->next;

}

return answer;

}

(50)

Parallelization Strategy

  Every thread should repeatedly take next Every thread should repeatedly take next

task from list and complete it, until there are task from list and complete it, until there are

no more tasks no more tasks

  We must ensure no two threads take same We must ensure no two threads take same take from the list; i.e., must declare a

take from the list; i.e., must declare a critical section

critical section

(51)

parallel Pragma

  The The parallel parallel pragma precedes a block pragma precedes a block of code that should be executed by

of code that should be executed by all all of the of the threads

threads

  Note: execution is replicated among all Note: execution is replicated among all threads

threads

(52)

Use of parallel Pragma

#pragma omp parallel private(task_ptr) {

task_ptr = get_next_task (&job_ptr);

while (task_ptr != NULL) { complete_task (task_ptr);

task_ptr = get_next_task (&job_ptr);

}

(53)

Critical Section for get_next_task

**char *get_next_task(struct job_struct**

job_ptr) { struct task_struct *answer;**

#pragma omp critical {

**if (*job_ptr == NULL) answer = NULL;**

else {

**answer = (*job_ptr)->task;**

job_ptr = (job_ptr)->next;

} }

return answer;

}

(54)

Programming

  The parallel pragma allows us to write The parallel pragma allows us to write SPMD SPMD - - style programs style programs

  In these programs we often need to know In these programs we often need to know number of threads and thread ID number number of threads and thread ID number

  OpenMP provides functions to retrieve this OpenMP provides functions to retrieve this information

information

(55)

Function omp_get_thread_num

  This function returns the thread This function returns the thread identification number

identification number

  If there are If there are t t threads, the ID numbers range threads, the ID numbers range from 0 to

from 0 to t t - - 1 1

  The master thread has ID number 0 The master thread has ID number 0 int omp

int omp _get_thread_num (void) _get_thread_num (void)

(56)

Function omp_get_num_threads

  Function Function omp omp _get_num_threads returns the _get_num_threads returns the number of active threads

number of active threads

  If call this function from sequential portion If call this function from sequential portion of program, it will return 1

of program, it will return 1

int omp

int omp _get_num_threads (void) _get_num_threads (void)

(57)

for Pragma

  The The parallel parallel pragma instructs every pragma instructs every thread to execute all of the code inside the thread to execute all of the code inside the

block block

  If we encounter a If we encounter a for for loop that we want to loop that we want to divide among threads, we use the

divide among threads, we use the for for pragma

pragma

#pragma

#pragma omp omp for for

(58)

Example Use of for Pragma

#pragma omp parallel private(i,j) for (i = 0; i < m; i++) {

low = a[i];

high = b[i];

if (low > high) {

printf ("Exiting (%d)\n", i);

break;

}

#pragma omp for

for (j = low; j < high; j++) c[j] = (c[j] - a[i])/b[i];

}

(59)

single Pragma

  Suppose we only want to see the output Suppose we only want to see the output once once

  The The single single pragma directs compiler that pragma directs compiler that only a single thread should execute the

only a single thread should execute the block of code the pragma precedes

block of code the pragma precedes

  Syntax: Syntax:

#pragma

#pragma omp omp single single

(60)

Use of single Pragma

#pragma omp parallel private(i,j) for (i = 0; i < m; i++) {

low = a[i];

high = b[i];

if (low > high) {

#pragma omp single

printf ("Exiting (%d)\n", i);

break;

}

#pragma omp for

for (j = low; j < high; j++) c[j] = (c[j] - a[i])/b[i];

}

(61)

nowait Clause

  Compiler puts a barrier synchronization at Compiler puts a barrier synchronization at end of every parallel for statement

end of every parallel for statement

  In our example, this is necessary: if a thread In our example, this is necessary: if a thread leaves loop and changes

leaves loop and changes low low or or high high , it , it may affect behavior of another thread

may affect behavior of another thread

  If we make these private variables, then it If we make these private variables, then it would be okay to let threads move ahead, would be okay to let threads move ahead,

which could reduce execution time

(62)

Use of nowait Clause

#pragma omp parallel private(i,j,low,high) for (i = 0; i < m; i++) {

low = a[i];

high = b[i];

if (low > high) {

#pragma omp single

printf ("Exiting (%d)\n", i);

break;

}

#pragma omp for nowait

for (j = low; j < high; j++) c[j] = (c[j] - a[i])/b[i];

}

(63)

Functional Parallelism

  To this point all of our focus has been on To this point all of our focus has been on exploiting data parallelism

exploiting data parallelism

  OpenMP allows us to assign different OpenMP allows us to assign different threads to different portions of code threads to different portions of code

(functional parallelism)

(64)

Functional Parallelism Example

v = alpha();

w = beta();

x = gamma(v, w);

y = delta();

printf ("%6.2f\n", epsilon(x,y));

alpha beta

gamma delta

epsilon

May execute alpha,

beta, and delta in

parallel

(65)

parallel sections Pragma

  Precedes a block of Precedes a block of k k blocks of code that blocks of code that may be executed concurrently by

may be executed concurrently by k k threads threads

  Syntax: Syntax:

#pragma

#pragma omp omp parallel sections parallel sections

(66)

section Pragma

  Precedes each block of code within the Precedes each block of code within the encompassing block preceded by the encompassing block preceded by the

parallel sections pragma parallel sections pragma

  May be omitted for first parallel section May be omitted for first parallel section after the parallel sections pragma

after the parallel sections pragma

  Syntax: Syntax:

#pragma

#pragma omp omp section section

(67)

Example of parallel sections

#pragma omp parallel sections {

#pragma omp section **/* Optional */**

v = alpha();

#pragma omp section w = beta();

#pragma omp section y = delta();

}

x = gamma(v, w);

printf ("%6.2f\n", epsilon(x,y));

(68)

Another Approach

alpha beta

gamma delta

epsilon

Execute alpha and beta in parallel.

Execute gamma and

delta in parallel.

(69)

sections Pragma

  Appears inside a parallel block of code Appears inside a parallel block of code

  Has same meaning as the Has same meaning as the parallel parallel sections

sections pragma pragma

  If multiple If multiple sections sections pragmas inside one pragmas inside one

parallel block, may reduce fork/join costs

(70)

Use of sections Pragma

#pragma omp parallel {

#pragma omp sections {

v = alpha();

#pragma omp section w = beta();

}

#pragma omp sections {

x = gamma(v, w);

#pragma omp section y = delta();

} }

printf ("%6.2f\n", epsilon(x,y));

(71)

Summary (1/3)

  OpenMP an API for shared OpenMP an API for shared - - memory memory parallel programming

parallel programming

  Shared Shared - - memory model based on fork/join memory model based on fork/join parallelism

parallelism

  Data parallelism Data parallelism

  parallel for pragma parallel for pragma

  reduction clause reduction clause

(72)

Summary (2/3)

  Functional parallelism (parallel sections pragma) Functional parallelism (parallel sections pragma)

  SPMD SPMD - - style programming (parallel pragma) style programming (parallel pragma)

  Critical sections (critical pragma) Critical sections (critical pragma)

  Enhancing performance of parallel for loops Enhancing performance of parallel for loops