Parallel Programming: for Multicore and Cluster Systems- P38 ppsx

7.1 Gaussian Elimination 363 can be used to solve several linear systems with the same matrix A and different right-hand side vectors b without repeating the elimination process. 7.1.1.2 Pivoting Forward elimination and LU decomposition require the division by a (k) kk and so these methods can only be applied when a (k) kk = 0. That is, even if det A = 0 and the system Ax = y is solvable, there does not need to exist a decomposition A = LU when a (k) kk is a zero element. However, for a solvable linear system, there exists a matrix resulting from permutations of rows of A, for which an LU decomposition is possible, i.e., BA= LU with a permutation matrix B describing the permutation of rows of A. The permutation of rows of A, if necessary, is included in the elimination process. In each elimination step, a pivot element is determined to substitute a (k) kk . A pivot element is needed when a (k) kk = 0 and when a (k) kk is very small which would induce an elimination factor, which is very large leading to imprecise computations. Pivoting strategies are used to find an appropriate pivot element. Typical strategies are column pivoting, row pivoting, and total pivoting. Column pivoting considers the elements a (k) kk ···a (k) nk of column k and determines the element a (k) rk , k ≤ r ≤ n, with the maximum absolute value. If r = k,therowsr and k of matrix A (k) and the values b (k) k and b (k) r of the vector b (k) are exchanged. Row pivoting determines a pivot element a (k) kr , k ≤ r ≤ n, within the elements a (k) kk ···a (k) kn of row k of matrix A (k) with the maximum absolute value. If r = k, the columns k and r of A (k) are exchanged. This corresponds to an exchange of the enumeration of the unknowns x k and x r of vector x. Total pivoting determines the element with the maximum absolute value in the matrix ˜ A (k) = (a (k) ij ), k ≤ i, j ≤ n, and exchanges columns and rows of A (k) depending on i = k and j = k. In practice, row or column pivoting is used instead of total pivoting, since they have smaller computation time, and total pivoting may also destroy special matrix structures like banded structures. The implementation of pivoting avoids the actual exchange of rows or columns in memory and uses index vectors pointing to the current rows of the matrix. The indexed access to matrix elements is more expensive but in total the indexed access is usually less expensive than moving entire rows in each elimination step. When supported by the programming language, a dynamic data storage in the form of separate vectors for rows of the matrix, which can be accessed through a vector pointing to the rows, may lead to more efficient implementations. The advantage is that matrix elements can still be accessed with a two-dimensional index expression but the exchange of rows corresponds to a simple exchange of pointers. 7.1.2 Parallel Row-Cyclic Implementation A parallel implementation of the Gaussian elimination is based on a data distribution of matrix A and of the sequence of matrices A (k) , k = 2, ,n, which can be a row-oriented, a column-oriented, or a checkerboard distribution, see Sect. 3.4. In this section, we consider a row-oriented distribution. 364 7 Algorithms for Systems of Linear Equations From the structure of the matrices A (k) it can be seen that a blockwise row- oriented data distribution is not suitable because of load imbalances: For a blockwise row-oriented distribution processor P q ,1≤ q ≤ p, owns the rows (q − 1) ·n/p + 1, ,q ·n/ p so that after the computation of A (k) with k = q ·n/ p +1thereisno computation left for this processor and it becomes idle. For a row-cyclic distribution, there is a better load balance, since processor P q ,1≤ q ≤ p, owns the rows q, q+p, q +2p, , i.e., it owns all rows i with 1 ≤ i ≤ n, and q = ((i −1) mod p)+1. The processors begin to get idle only after the first n − p stages, which is reasonable for p  n. Thus, we consider a parallel implementation of the Gaussian elimination with a row-cyclic distribution of matrix A and a column-oriented pivoting. One step of the forward elimination computing A (k+1) and b (k+1) for given A (k) and b (k) performs the following computation and communication phases: 1. Determination of the local pivot element: Each processor considers its local elements of column k in the rows k, ,n and determines the element (and its position) with the largest absolute value. 2. Determination of the global pivot element: The global pivot element is the local pivot element which has the largest absolute value. A single-accumulation operation with the maximum operation as reduction determines this global pivot element. The root processor of this global communication operation sends the result to all other processors. 3. Exchange of the pivot row: If k =r for a pivot element a (k) rk ,therowk owned by processor P q and the pivot row r owned by processor P q  have to be exchanged. When q = q  , the exchange can be done locally by processor P q . When q = q  , then communication with single transfer operations is required. The elements b k and b r are exchanged accordingly. 4. Distribution of the pivot row: Since the pivot row (now row k) is required by all processors for the local elimination operations, processor P q sends the elements a (k) kk , ,a (k) kn of row k and the element b (k) k to all other processors. 5. Computation of the elimination factors: Each processor locally computes the elimination factors l ik for which it owns the row i according to Formula (7.2). 6. Computation of the matrix elements: Each processor locally computes the elements of A (k+1) and b (k+1) using its elements of A (k) and b (k) according to Formulas (7.3) and (7.4). The computation of the solution vector x in the backward substitution is inherently sequential, since the values x k , k = n, ,1, depend on each other and are computed one after another. In step k, processor P q owning row k computes the value x k according to Formula (7.5) and sends the value to all other processors by a single- broadcast operation. A program fragment implementing the computation phases 1–6 and the backward substitution is given in Fig. 7.2. The matrix A and the vector b arestoredina two- and a one-dimensional array a and b, respectively. Some of the local functions are already introduced in the program in Fig. 7.1. The SPMD program uses the variable me to store the individual processor number. This processor number, the 7.1 Gaussian Elimination 365 Fig. 7.2 Program fragment with C notation and MPI operations for the Gaussian elimination with row-cyclic distribution 366 7 Algorithms for Systems of Linear Equations current value of k, and the pivot row are used to distinguish between different computations for the single processors. The global variables n and p are the system size and the number of processors executing the parallel program. The parallel algorithm is implemented in the program in Fig. 7.2 in the following way: 1. Determination of the local pivot element: The function max col loc(a,k) determines the row index r of the element a[r][k], which has the largest local absolute value in column k for the rows ≥ k. When a processor has no element of column k for rows ≥ k, the function returns −1. 2. Determination of the global pivot element: The global pivoting is performed by an MPI Allreduce() operation, implementing a single-accumulation with a subsequent single-broadcast. The MPI reduction operation MPI MAXLOC for data type MPI DOUBLE INT consisting of one double value and one integer value is used. The MPI operations have been introduced in Sect. 5.2. The MPI Allreduce() operation returns y with the pivot element in y.val and the processor owning the corresponding row in y.node. Thus, after this step all processors know the global pivot element and the owner for possible communication. 3. Exchange of the pivot row: Two cases are considered: • If the owner of the pivot row is the processor also owning row k (i.e., k%p == y.node), the rows k and r are exchanged locally by this processor for r = k. Row k is now the pivot row. The function copy row(a,b,k,buf) copies the pivot row into the buffer buf, which is used for further communication. • If different processors own the row k and the pivot row r,rowk is sent to the processor y.node owning the pivot row with MPI Send and MPI Recv operations. Before the send operation, the function copy row(a,b,k,buf) copies row k of array a and element k of array b into a common buffer buf so that only one communication operation needs to be applied. After the communication, the processor y.node finalizes its exchange with the pivot row. The function copy exchange row(a,b,r,buf,k) exchanges the row r (still the pivot row) and the buffer buf. The appropriate row index r is known from the former local determination of the pivot row. Now the former row k is the row r and the buffer buf contains the pivot row. Thus, in both cases the pivot row is stored in buffer buf. 4. Distribution of the pivot row: Processor y.node sends the buffer buf to all other processors by an MPI Bcast() operation. For the case of the pivot row being owned by a different processor than the owner of row k, the content of buf is copied into row k by this processor using copy back row(). 5. and 6. Computation of the elimination factors and the matrix elements: The computation of the elimination factors and the new arrays a and b is done in parallel. Processor P q starts this computation with the first row i > k with i mod p = q. For a row-cyclic implementation of the Gaussian elimination, an alternative way of storing array a and vector b can be used. The alternative data structure consists 7.1 Gaussian Elimination 367 Fig. 7.3 Data structure for the Gaussian elimination with n = 8andp = 4 showing the rows stored by processor P 1 . Each row stores n +1 elements consisting of one row of the matrix a and the corresponding element of b b 1 b a a a a 11 a 18 558 51 of a one-dimensional array of pointers and n one-dimensional arrays of length n +1 each containing one row of a and the corresponding element of b. The entries in the pointer-array point to the row-arrays. This storage scheme not only facilitates the exchange of rows but is also convenient for a distributed storage. For a distributed memory, each processor P q stores the entire array of pointers but only the rows i with i mod p = q; all other pointers are NULL-pointers. Figure 7.3 illustrates this storage scheme for n = 8. The advantage of storing an element of b together with a is that the copy operation into a common buffer can be avoided. Also the computation of the new values for a and b is now only one loop with n+1 iterations. This implementation variant is not shown in Fig. 7.2. 7.1.3 Parallel Implementation with CheckerboardDistribution A parallel implementation using a block–cyclic checkerboard distribution for matrix A can be described with the parameterized data distribution introduced in Sect. 3.4. The parameterized data distribution is given by a distribution vector ((p 1 , b 1 ), (p 2 , b 2 )) (7.7) with a p 1 × p 2 virtual processor mesh with p 1 rows, p 2 columns, and p 1 · p 2 = p processors. The numbers b 1 and b 2 are the sizes of a block of data with b 1 rows and b 2 columns. The function G : P → N 2 maps each processor to a unique position in the processor mesh. This leads to the definition of p 1 row groups R q ={Q ∈ P | G(Q) = (q, ·)} with |R q |=p 2 for 1 ≤ q ≤ p 1 and p 2 column groups C q ={Q ∈ P | G(Q) = (·, q)} with |C q |=p 1 for 1 ≤ q ≤ p 2 . The row groups as well as the column groups are a partition of the entire set of processors, i.e., 368 7 Algorithms for Systems of Linear Equations p 1  q=1 R q = p 2  q=1 C q = P and R q ∩R q  =∅=C q ∩C q  for q = q  .Rowi of the matrix A is distributed across the local memories of the processors of only one row group, denoted Ro(i)inthe following. This is the row group R k with k =  i−1 b 1  mod p 1 + 1. Analogously, column j is distributed within one column group, denoted as Co( j), which is the column group C k with k =  j−1 b 2  mod p 2 +1. Example For a matrix of size 12×12 (i.e., n = 12), p = 4 processors {P 1 , P 2 , P 3 , P 4 } and distribution vector ((p 1 , b 1 ), (p 2 , b 2 )) = ((2, 2), (2, 3)), the virtual processor mesh has size 2 ×2 and the data blocks have size 2 ×3. There are two row groups and two column groups: R 1 ={Q ∈ P | G(Q) = (1, j), j = 1, 2}, R 2 ={Q ∈ P | G(Q) = (2, j), j = 1, 2}, C 1 ={Q ∈ P | G(Q) = ( j, 1), j = 1, 2}, C 2 ={Q ∈ P | G(Q) = ( j, 2), j = 1, 2} . The distribution of matrix A is shown in Fig. 7.4. It can be seen that row 5 is distributed in row group R 1 and that column 7 is distributed in column group C 1 . Using a checkerboard distribution with distribution vector (7.7), the computation of A (k) has the following implementation, which has a different communication pat- tern than the previous implementation. Figure 7.5 illustrates the communication and computation phases of the Gaussian elimination with checkerboard distribution. Fig. 7.4 Illustration of a checkerboard distribution for a12×12 matrix. The tuples denote the position of the processors in the processor mesh owning the data block 1 2 3 4 5 6 7 8 9 10 11 12 12345 6789101112 (1,1) (1,2) (1,1) (1,2) (2,1) (2,2) (2,1) (2,2) (1,1) (2,1) (1,2) (2,2) (1,1) (2,1) (1,2) (2,2) (1,1) (2,1) (1,2) (2,2) (1,1) (2,1) (1,2) (2,2) 7.1 Gaussian Elimination 369 Fig. 7.5 Computation phases of the Gaussian elimination with checkerboard distribution k k k k k k (5a) broadcast of the k k r (5) computation of the matrix elements (4) broadcast of the k k r pivot row pivot row k k (5) computation of the elimination factors elimination factors (3) exchange of the k k (1) determination of the (2) determination of the global pivot elements local pivot elements This figure will be printed in b/w 370 7 Algorithms for Systems of Linear Equations 1. Determination of the local pivot element: Since column k is distributed across the processors of column group Co(k), only these processors determine the element with the largest absolute value for row ≥ k within their local elements of column k. 2. Determination of the global pivot element: The processors in group Co(k) perform a single-accumulation operation within this group, for which each processor in the group provides its local pivot element from phase 1. The reduction operation is the maximum operation also determining the index of the pivot row (and not the number of the owning processor as before). The root processor of the single-accumulation operation is the processor owning the element a (k) kk . After the single-accumulation, the root processor knows the pivot element a (k) rk and its row index. This information is sent to all other processors. 3. Exchange of the pivot row: The pivot row r containing the pivot element a (k) rk is distributed across row group Ro(r). Row k is distributed across the row group Ro(k), which may be different from Ro(r). If Ro(r) = Ro(k), the processors of Ro(k) exchange the elements of the rows k and r locally within the columns they own. If Ro(r) = Ro(k), each processor in Ro(k) sends its part of row k to the corresponding processor in Ro(r); this is the unique processor which belongs to the same column group. 4. Distribution of the pivot row: The pivot row is needed for the recalculation of matrix A, but each processor needs only those elements with column indices for which it owns elements. Therefore, each processor in Ro(r) performs a group- oriented single-broadcast operation within its column group sending its part of the pivot row to the other processors. 5. Computation of the elimination factors: The processors of column group Co(k) locally compute the elimination factors l ik for their elements i of column k according to Formula (7.2). 5a. Distribution of the elimination factors: The elimination factors l ik are needed by all processors in the row group Ro(i). Since the elements of row i are distributed across the row group Ro(i), each processor of column group Co(k) performs a group-oriented single-broadcast operation in its row group Ro(i)to broadcast its elimination factors l ik within this row group. 6. Computation of the matrix elements: Each processor locally computes the elements of A (k+1) and b (k+1) using its elements of A (k) and b (k) according to Formulas (7.3) and (7.4). The backward substitution for computing the n elements of the result vector x is done in n consecutive steps where each step consists of the following computations: 1. Each processor of the row group Ro(k) computes that part of the sum  n j=k+1 a (n) kj x j which contains its local elements of row k. 2. The entire sum  n j=k+1 a (n) kj x j is determined by the processors of row group Ro(k) by a group-oriented single-accumulation operation with the processor P q as root which stores the element a (n) kk . Addition is used as reduction operation. 7.1 Gaussian Elimination 371 3. Processor P q computes the value of x k according to Formula (7.5). 4. Processor P q sends the value of x k to all other processors by a single-broadcast operation. A pseudocode for an SPMD program in C notation with MPI operations implementing the Gaussian elimination with checkerboard distribution of matrix A is given in Fig. 7.6. The computations correspond to those given in the pseudocode for the row-cyclic distribution in Fig. 7.2, but the pseudocode additionally uses several functions organizing the computations on the groups of processors. The functions Co(k) and Ro(k) denote the column or row groups owning column k or row k, respectively. The function member(me,G) determines whether processor me belongs to group G. The function grp leader() determines the first processor in a group. The functions Cop(q) and Rop(q) determine the column or row group, respectively, to which a processor q belongs. The function rank(q,G) returns the local processor number (rank) of a processor in a group G. 1. Determination of the local pivot element: The determination of the local pivot element is performed only by the processors in column group Co(k). 2. Determination of the global pivot element: The global pivot element is again computed by an MPI MAXLOC reduction operation, but in contrast to Fig. 7.2 the index of the row of the pivot element is calculated and not the processor number owning the pivot element. The reason is that all processors which own a part of the pivot row need to know that some of their data belongs to the current pivot row; this information is used in further communication. 3. Exchange of the pivot row: For the exchange and distribution of the pivot row r, the cases Ro(k) == Ro(r) and Ro(k)! = Ro(r) are distinguished. • When the pivot row and the row k are stored by the same row group, each processor of this group exchanges its data elements of row k and row r locally using the function exchange row loc() and copies the elements of the pivot row (now row k) into the buffer buf using the function copy row loc(). Only the elements in column k or higher are considered. • When the pivot row and the row k are stored by different row groups, communication is required for the exchange of the pivot row. The function compute partner(Ro(r),me) computes the communication partner for the calling processor me, which is the processor q ∈ Ro(r) belonging to the same column group as me. The function compute size(n,k,Ro(k)) computes the number of elements of the pivot row, which is stored for the calling processor in columns greater than k; this number depends on the size of the row group Ro(k), the block size, and the position k. The same function is used later to determine the number of elimination factors to be communicated. 4. Distribution of the pivot row: For the distribution of the pivot row r, a processor takes part in a single-broadcast operation in its column group. The roots of the broadcast operation performed in parallel are the processors q ∈ Ro(r).The participants of a broadcast are the processors q  ∈ Cop(q), either as root when q  ∈ Ro(r) or as recipient otherwise. 372 7 Algorithms for Systems of Linear Equations Fig. 7.6 Program of the Gaussian elimination with checkerboard distribution . distribution of matrix A and a column-oriented pivoting. One step of the forward elimination computing A (k+1) and b (k+1) for given A (k) and b (k) performs the following computation and communication. matrix A and different right-hand side vectors b without repeating the elimination process. 7.1.1.2 Pivoting Forward elimination and LU decomposition require the division by a (k) kk and so these methods. to Formula (7.2). 6. Computation of the matrix elements: Each processor locally computes the elements of A (k+1) and b (k+1) using its elements of A (k) and b (k) according to Formulas (7.3) and

Parallel Programming: for Multicore and Cluster Systems- P38 ppsx

Thông tin tài liệu

Từ khóa liên quan

Mục lục

364204817X

Parallel Programming

Preface

Contents

to 1 Introduction

Classical Use of Parallelism

Parallelism in Today's Hardware

Basic Concepts

Overview of the Book

to 2 Parallel Computer Architecture

Processor Architecture and Technology Trends

Flynn's Taxonomy of Parallel Architectures

Memory Organization of Parallel Computers

Computers with Distributed Memory Organization

Computers with Shared Memory Organization

Reducing Memory Access Times

Thread-Level Parallelism

Simultaneous Multithreading

Multicore Processors

Architecture of Multicore Processors

Interconnection Networks

Properties of Interconnection Networks

Direct Interconnection Networks

Embeddings

Dynamic Interconnection Networks

Tài liệu cùng người dùng

Tài liệu liên quan