• 沒有找到結果。

A distributed memory system

N/A
N/A
Protected

Academic year: 2022

Share "A distributed memory system"

Copied!
62
0
0

加載中.... (立即查看全文)

全文

(1)

Copyright © 2010, Elsevier Inc. All rights Reserved 1

Chapter 3

Distributed Memory Programming with MPI Peter Pacheco

Roadmap

Writing your first MPI program.

Using the common MPI functions.

The Trapezoidal Rule in MPI.

Collective communication.

MPI derived datatypes.

Performance evaluation of MPI programs.

# Chapter Subtitle

(2)

3

A distributed memory system

Copyright © 2010, Elsevier Inc. All rights Reserved

A shared memory system

(3)

5

Hello World!

Copyright © 2010, Elsevier Inc. All rights Reserved

(a classic)

Identifying MPI processes

Common practice to identify processes by nonnegative integer ranks.

p processes are numbered 0, 1, 2, .. p-1

(4)

7

Our first MPI program

Copyright © 2010, Elsevier Inc. All rights Reserved

Compilation

mpicc -g -Wall -o mpi_hello mpi_hello.c

wrapper script to compile

turns on all warnings

source file

create this executable file name (as opposed to default a.out) produce

debugging information

(5)

9

Execution

Copyright © 2010, Elsevier Inc. All rights Reserved

mpiexec -n <number of processes> <executable>

mpiexec -n 1 ./mpi_hello

mpiexec -n 4 ./mpi_hello

run with 1 process

run with 4 processes

Execution

mpiexec -n 1 ./mpi_hello

mpiexec -n 4 ./mpi_hello

Greetings from process 0 of 1 !

(6)

11

MPI Programs

Written in C.

Has main.

Uses stdio.h, string.h, etc.

Need to add mpi.h header file.

Identifiers defined by MPI start with

“MPI_”.

First letter following underscore is uppercase.

For function names and MPI-defined types.

Helps to avoid confusion.

Copyright © 2010, Elsevier Inc. All rights Reserved

MPI Components

MPI_Init

Tells MPI to do all the necessary setup.

MPI_Finalize

Tells MPI we’re done, so clean up anything allocated for this program.

(7)

13

Basic Outline

Copyright © 2010, Elsevier Inc. All rights Reserved

Communicators

A collection of processes that can send messages to each other.

MPI_Init defines a communicator that

consists of all the processes created when the program is started.

(8)

15

Communicators

Copyright © 2010, Elsevier Inc. All rights Reserved

number of processes in the communicator

my rank

(the process making this call)

SPMD

Single-Program Multiple-Data

We compile one program.

Process 0 does something different.

Receives messages and prints them while the other processes do the work.

The if-else construct makes our program SPMD.

(9)

17

Communication

Copyright © 2010, Elsevier Inc. All rights Reserved

Data types

(10)

19

Communication

Copyright © 2010, Elsevier Inc. All rights Reserved

Message matching

MPI_Send src = q

MPI_Recv dest = r

r

q

(11)

21

Receiving messages

A receiver can get a message without knowing:

the amount of data in the message,

the sender of the message,

or the tag of the message.

Copyright © 2010, Elsevier Inc. All rights Reserved

status_p argument

MPI_Status*

(12)

23

How much data am I receiving?

Copyright © 2010, Elsevier Inc. All rights Reserved

Issues with send and receive

Exact behavior is determined by the MPI implementation.

MPI_Send may behave differently with regard to buffer size, cutoffs and blocking.

MPI_Recv always blocks until a matching message is received.

Know your implementation;

don’t make assumptions!

(13)

25

TRAPEZOIDAL RULE IN MPI

Copyright © 2010, Elsevier Inc. All rights Reserved

The Trapezoidal Rule

(14)

27

The Trapezoidal Rule

Copyright © 2010, Elsevier Inc. All rights Reserved

One trapezoid

(15)

29

Pseudo-code for a serial program

Copyright © 2010, Elsevier Inc. All rights Reserved

Parallelizing the Trapezoidal Rule

1. Partition problem solution into tasks.

2. Identify communication channels between tasks.

3. Aggregate tasks into composite tasks.

4. Map composite tasks to cores.

(16)

31

Parallel pseudo-code

Copyright © 2010, Elsevier Inc. All rights Reserved

Tasks and communications for

Trapezoidal Rule

(17)

33

First version (1)

Copyright © 2010, Elsevier Inc. All rights Reserved

First version (2)

(18)

35

First version (3)

Copyright © 2010, Elsevier Inc. All rights Reserved

Dealing with I/O

Each process just prints a message.

(19)

37

Running with 6 processes

Copyright © 2010, Elsevier Inc. All rights Reserved

unpredictable output

Input

Most MPI implementations only allow

process 0 in MPI_COMM_WORLD access to stdin.

Process 0 must read the data (scanf) and send to the other processes.

(20)

39

Function for reading user input

Copyright © 2010, Elsevier Inc. All rights Reserved

COLLECTIVE

COMMUNICATION

(21)

41

Tree-structured communication

1. In the first phase:

(a) Process 1 sends to 0, 3 sends to 2, 5 sends to 4, and 7 sends to 6.

(b) Processes 0, 2, 4, and 6 add in the received values.

(c) Processes 2 and 6 send their new values to processes 0 and 4, respectively.

(d) Processes 0 and 4 add the received values into their new values.

2. (a) Process 4 sends its newest value to process 0.

(b) Process 0 adds the received value to its newest value.

Copyright © 2010, Elsevier Inc. All rights Reserved

A tree-structured global sum

(22)

43

An alternative tree-structured global sum

Copyright © 2010, Elsevier Inc. All rights Reserved

MPI_Reduce

(23)

45

Predefined reduction operators in MPI

Copyright © 2010, Elsevier Inc. All rights Reserved

Collective vs. Point-to-Point Communications

All the processes in the communicator must call the same collective function.

For example, a program that attempts to match a call to MPI_Reduce on one

(24)

47

Collective vs. Point-to-Point Communications

The arguments passed by each process to an MPI collective communication must be

compatible.

For example, if one process passes in 0 as the dest_process and another passes in 1, then the outcome of a call to MPI_Reduce is erroneous, and, once again, the program is likely to hang or crash.

Copyright © 2010, Elsevier Inc. All rights Reserved

Collective vs. Point-to-Point Communications

The output_data_p argument is only used on dest_process.

However, all of the processes still need to pass in an actual argument corresponding to output_data_p, even if it’s just NULL.

(25)

49

Collective vs. Point-to-Point Communications

Point-to-point communications are matched on the basis of tags and communicators.

Collective communications don’t use tags.

They’re matched solely on the basis of the communicator and the order in which

they’re called.

Copyright © 2010, Elsevier Inc. All rights Reserved

Example (1)

Multiple calls to MPI_Reduce

(26)

51

Example (2)

Suppose that each process calls

MPI_Reduce with operator MPI_SUM, and destination process 0.

At first glance, it might seem that after the two calls to MPI_Reduce, the value of b will be 3, and the value of d will be 6.

Copyright © 2010, Elsevier Inc. All rights Reserved

Example (3)

However, the names of the memory

locations are irrelevant to the matching of the calls to MPI_Reduce.

The order of the calls will determine the matching so the value stored in b will be 1+2+1 = 4, and the value stored in d will be 2+1+2 = 5.

(27)

53

MPI_Allreduce

Useful in a situation in which all of the processes need the result of a global sum in order to complete some larger

computation.

Copyright © 2010, Elsevier Inc. All rights Reserved

A global sum followed by distribution of the result.

(28)

Copyright © 2010, Elsevier Inc. All rights Reserved 55

A butterfly-structured global sum.

Broadcast

Data belonging to a single process is sent to all of the processes in the communicator.

(29)

Copyright © 2010, Elsevier Inc. All rights Reserved 57

A tree-structured broadcast.

A version of Get_input that uses

MPI_Bcast

(30)

59

Data distributions

Copyright © 2010, Elsevier Inc. All rights Reserved

Compute a vector sum.

Serial implementation of vector

addition

(31)

61

Different partitions of a 12- component vector among 3 processes

Copyright © 2010, Elsevier Inc. All rights Reserved

Partitioning options

Block partitioning

Assign blocks of consecutive components to each process.

Cyclic partitioning

Assign components in a round robin fashion.

Block-cyclic partitioning

(32)

63

Parallel implementation of vector addition

Copyright © 2010, Elsevier Inc. All rights Reserved

Scatter

MPI_Scatter can be used in a function that reads in an entire vector on process 0 but only sends the needed components to each of the other processes.

(33)

65

Reading and distributing a vector

Copyright © 2010, Elsevier Inc. All rights Reserved

Gather

Collect all of the components of the vector onto process 0, and then process 0 can process all of the components.

(34)

67

Print a distributed vector (1)

Copyright © 2010, Elsevier Inc. All rights Reserved

Print a distributed vector (2)

(35)

69

Allgather

Concatenates the contents of each process’

send_buf_pand stores this in each process’

recv_buf_p.

As usual, recv_countis the amount of data being received from each process.

Copyright © 2010, Elsevier Inc. All rights Reserved

Matrix-vector multiplication

(36)

71

Matrix-vector multiplication

Copyright © 2010, Elsevier Inc. All rights Reserved

Multiply a matrix by a vector

Serial pseudo-code

(37)

73

C style arrays

Copyright © 2010, Elsevier Inc. All rights Reserved

stored as

Serial matrix-vector multiplication

(38)

75

An MPI matrix-vector

multiplication function (1)

Copyright © 2010, Elsevier Inc. All rights Reserved

An MPI matrix-vector

multiplication function (2)

(39)

77

MPI DERIVED DATATYPES

Copyright © 2010, Elsevier Inc. All rights Reserved

Derived datatypes

Used to represent any collection of data items in memory by storing both the types of the items and their relative locations in memory.

The idea is that if a function that sends data knows this information about a collection of data items, it can collect the items from memory before they are sent.

(40)

79

Derived datatypes

Formally, consists of a sequence of basic MPI data types together with a

displacement for each of the data types.

Trapezoidal Rule example:

Copyright © 2010, Elsevier Inc. All rights Reserved

MPI_Type create_struct

Builds a derived datatype that consists of individual elements that have different basic types.

(41)

81

MPI_Get_address

Returns the address of the memory location referenced by location_p.

The special type MPI_Aint is an integer type that is big enough to store an address on the system.

Copyright © 2010, Elsevier Inc. All rights Reserved

MPI_Type_commit

Allows the MPI implementation to optimize its internal representation of the datatype for use in communication functions.

(42)

83

MPI_Type_free

When we’re finished with our new type, this frees any additional storage used.

Copyright © 2010, Elsevier Inc. All rights Reserved

Get input function with a derived

datatype (1)

(43)

85

Get input function with a derived datatype (2)

Copyright © 2010, Elsevier Inc. All rights Reserved

Get input function with a derived

datatype (3)

(44)

87

PERFORMANCE EVALUATION

Copyright © 2010, Elsevier Inc. All rights Reserved

Elapsed parallel time

Returns the number of seconds that have elapsed since some time in the past.

(45)

89

Elapsed serial time

In this case, you don’t need to link in the MPI libraries.

Returns time in microseconds elapsed from some point in the past.

Copyright © 2010, Elsevier Inc. All rights Reserved

Elapsed serial time

(46)

91

MPI_Barrier

Ensures that no process will return from calling it until every process in the

communicator has started calling it.

Copyright © 2010, Elsevier Inc. All rights Reserved

MPI_Barrier

(47)

93

Run-times of serial and parallel matrix-vector multiplication

Copyright © 2010, Elsevier Inc. All rights Reserved

(Seconds)

Speedup

(48)

95

Efficiency

Copyright © 2010, Elsevier Inc. All rights Reserved

Speedups of Parallel Matrix-

Vector Multiplication

(49)

97

Efficiencies of Parallel Matrix- Vector Multiplication

Copyright © 2010, Elsevier Inc. All rights Reserved

Scalability

A program is scalable if the problem size can be increased at a rate so that the

efficiency doesn’t decrease as the number of processes increase.

(50)

99

Scalability

Programs that can maintain a constant efficiency without increasing the problem size are sometimes said to be strongly scalable.

Programs that can maintain a constant efficiency if the problem size increases at the same rate as the number of processes are sometimes said to be weakly scalable.

Copyright © 2010, Elsevier Inc. All rights Reserved

A PARALLEL SORTING

ALGORITHM

(51)

101

Sorting

n keys and p = comm sz processes.

n/p keys assigned to each process.

No restrictions on which keys are assigned to which processes.

When the algorithm terminates:

The keys assigned to each process should be sorted in (say) increasing order.

If 0 ≤ q < r < p, then each key assigned to

process qshould be less than or equal to every key assigned to process r.

Copyright © 2010, Elsevier Inc. All rights Reserved

Serial bubble sort

(52)

103

Odd-even transposition sort

A sequence of phases.

Even phases, compare swaps:

Odd phases, compare swaps:

Copyright © 2010, Elsevier Inc. All rights Reserved

Example

Start: 5, 9, 4, 3

Even phase: compare-swap (5,9) and (4,3) getting the list 5, 9, 3, 4

Odd phase: compare-swap (9,3) getting the list 5, 3, 9, 4

Even phase: compare-swap (5,3) and (9,4) getting the list 3, 5, 4, 9

Odd phase: compare-swap (5,4) getting the list 3, 4, 5, 9

(53)

105

Serial odd-even transposition sort

Copyright © 2010, Elsevier Inc. All rights Reserved

Communications among tasks in

odd-even sort

(54)

107

Parallel odd-even transposition sort

Copyright © 2010, Elsevier Inc. All rights Reserved

Pseudo-code

(55)

109

Compute_partner

Copyright © 2010, Elsevier Inc. All rights Reserved

Safety in MPI programs

The MPI standard allows MPI_Send to behave in two different ways:

it can simply copy the message into an MPI managed buffer and return,

or it can block until the matching call to MPI_Recv starts.

(56)

111

Safety in MPI programs

Many implementations of MPI set a threshold at which the system switches from buffering to blocking.

Relatively small messages will be buffered by MPI_Send.

Larger messages, will cause it to block.

Copyright © 2010, Elsevier Inc. All rights Reserved

Safety in MPI programs

If the MPI_Send executed by each process blocks, no process will be able to start

executing a call to MPI_Recv, and the program will hang or deadlock.

Each process is blocked waiting for an event that will never happen.

(see pseudo-code)

(57)

113

Safety in MPI programs

A program that relies on MPI provided buffering is said to be unsafe.

Such a program may run without problems for various sets of input, but it may hang or crash with other sets.

Copyright © 2010, Elsevier Inc. All rights Reserved

MPI_Ssend

An alternative to MPI_Send defined by the MPI standard.

The extra “s” stands for synchronous and MPI_Ssend is guaranteed to block until the matching receive starts.

(58)

115

Restructuring communication

Copyright © 2010, Elsevier Inc. All rights Reserved

MPI_Sendrecv

An alternative to scheduling the communications ourselves.

Carries out a blocking send and a receive in a single call.

The dest and the source can be the same or different.

Especially useful because MPI schedules the communications so that the program won’t hang or crash.

(59)

117

MPI_Sendrecv

Copyright © 2010, Elsevier Inc. All rights Reserved

Safe communication with five

processes

(60)

Copyright © 2010, Elsevier Inc. All rights Reserved 119

Run-times of parallel odd-even sort

(times are in milliseconds)

(61)

121

Concluding Remarks (1)

MPI or the Message-Passing Interface is a library of functions that can be called from C, C++, or Fortran programs.

A communicator is a collection of processes that can send messages to each other.

Many parallel programs use the single- program multiple data or SPMD approach.

Copyright © 2010, Elsevier Inc. All rights Reserved

Concluding Remarks (2)

Most serial programs are deterministic: if we run the same program with the same input we’ll get the same output.

Parallel programs often don’t possess this property.

Collective communications involve all the

(62)

123

Concluding Remarks (3)

When we time parallel programs, we’re usually interested in elapsed time or “wall clock time”.

Speedup is the ratio of the serial run-time to the parallel run-time.

Efficiency is the speedup divided by the number of parallel processes.

Copyright © 2010, Elsevier Inc. All rights Reserved

Concluding Remarks (4)

If it’s possible to increase the problem size (n) so that the efficiency doesn’t decrease as p is increased, a parallel program is said to be scalable.

An MPI program is unsafe if its correct behavior depends on the fact that

MPI_Send is buffering its input.

參考文獻

相關文件

 From a source vertex, systematically follow the edges of a graph to visit all reachable vertices of the graph.  Useful to discover the structure of

All rights reserved.... All

Given a shift κ, if we want to compute the eigenvalue λ of A which is closest to κ, then we need to compute the eigenvalue δ of (11) such that |δ| is the smallest value of all of

As soon as a crisis management plan is drawn up, the school principal should conduct a Staff Meeting (Annex 5) to inform all staff of the situation; to clarify the

This kind of algorithm has also been a powerful tool for solving many other optimization problems, including symmetric cone complementarity problems [15, 16, 20–22], symmetric

* All rights reserved, Tei-Wei Kuo, National Taiwan University, 2005..

The remaining positions contain //the rest of the original array elements //the rest of the original array elements.

Because the nodes represent a partition of the belief space and because all belief states within a particular region will map to a single node on the next level, the plan