The MIMD version is implemented for use on workstation clusters as well as multiprocessor supercomputers like the Intel Paragon. As a foundation for this implementation, the ScaLAPACK [] version 1.1 and BLACS [] libraries are used.
The ScaLAPACK version is like the other two implementations
implemented in layers, as shown in Figure
. This
version has six layers: Application, SymPLA top-level layers, MIMD
Scope class, which uses the BLACS to communicate with the ScaLAPACK
interface code on the backend. The interface code uses the BLACS and
ScaLAPACK libraries to perform the desired calculations. A full
description of this implementation is available in the implementation
document [].
Figure: Organization of SymPLA for MIMD computers
The MIMD scope class handles the data required to identify the matrix, provides the necessary public information about it, and the initial preparations for the operations. Using BLACS, the scope sends the requests and all data required to the ScaLAPACK interface code on the backend, which uses the data for the selected operation.
The backend is implemented as a separate executable that is started in an architecture specific way, either from the command line (Paragon, MPI) or from the application (PVM). This implementation has been tested on Sparc and DEC Alpha Workstation clusters, as well as the Intel Paragon, using PVM, MPI and NX communication libraries as a foundation for the BLACS.
Since the MPI environment required delayed initialization of the backend, initialization of the backend could not be done before after the main-function of the application had initialized MPI separately. Presently, this makes use of global variables of the Matrix or Vector classes impossible in applications intended to be MPI compatible; it is still permissible in the versions using the NX and PVM versions of BLACS. Initialization of the ScaLAPACK interface is done partially before main is called, and fully when the first Matrixscope or Vectorscope object is created by a Matrix or Vector object.
The initialization is handled by a class Clientcount whose counter is incremented each time a scope (or any object holding the Clientcount class) is created, and decremented each time a scope is destroyed. The first time the Clientcount counter is incremented the global ScaLAPACKclient object is completely initialized. At present, the decrementation of the counter and a zero count is not used, but may be used for debugging.
Since there exists no native C++ compiler for the Intel Paragon, a cfront based C++ compiler was configured by Intel. For the other architectures, Sparc and DEC Alpha, the SparcWorks C++ compiler and the DEC OSF/1 C++ compiler, respectively, were used. These two compilers do support exceptions, but their use has not been tested in practice, as the Intel Paragon cfront compiler was used, and for simplicity, exceptions were mostly disabled. In contrast to the other implementations, the backend part of this system uses a larger, experimental set of exception classes that is more to the point about what caused the exception to be thrown.
The scopes use the following data definition:
class Clientcount{
private:
static int counter;
};
class Matrixscope : public Clientcount{
private:
struct matstore{
int matrixdescriptor;
int size_x,size_y;
int refs;
} *storage;
int *xsel,xselcount;
int *ysel,yselcount;
int *solverpermutation,permsize;
};
class Vectorscope : public Clientcount{
private:
struct vecstore{
int vectordescriptor;
int size;
int refs;
} *storage;
int *sel,selcount;
};
The member matrixdescriptor is an integer used as a key into a list of matrices on the backend.
In addition, these internal classes are used:
struct ScaLAPACKparam{
public:
int descriptor; // Matrix descriptor
MatrixTranspose transposed; // Matrix is transposed
double scale; // scale matrix with factor
int unit; // unit diagonal (1)
int upper; // Upper (1) or lower (0) triangular part
int newsizeX,newsizeY; // gives dimension of a new matrix
int usefunction; // Number of userdefined function to be used
private:
struct{
int n;
int *index;
} Xsel,Ysel;
};
class ScaLAPACKclient{
private:
int mynode,meshsize;
int blocking;
int ID_codes_start,ID_codes_end;
int maxdim,*transmitdata;
static int initialized;
static int original;
};
class descriptordata{
private:
static int context,blocking;
int descriptor;
struct{
int sizeX,sizeY;
} local,global;
int lda[8];
double *data;
struct descriptordata *next;
};
class ScaLAPACKserver{
private:
int blocking;
static int meshsize; // Number of processors
int mynode;
static int context;
static int descriptorcount;
static int quitcount;
int ID_codes_start,ID_codes_end;
static descriptordata *descriptors,*last;
static int initialized;
static int original;
static ScaLAPACKIDcodes buffer;
// Distribution data
static int *indexX,*indexY,*transmitdata;
static int maxdim;
ScaLAPACKprocid requestnode;
};
The class ScaLAPACKparam is used for passing information about the matrix to be used. It is also used internally on the backend for argument passing. ScaLAPACKclient is used to control communication with the backend. The ScaLAPACKserver class is the backend, and uses the descriptordata class to hold the information about all declared matrices in a linked list of this class. Each matrix is identified by a unique integer. Some of the members of ScaLAPACKserver, which are dynamic arrays, are resized as the dimensions of the matrices increase, making reallocation of buffer space mostly unnecessary.
The scopes are used to set up the arguments and call a global object Matrixclient of the ScaLAPACKclient class. The ScaLAPACKclient then does the actual communication to the backend. The ScaLAPACKclient is also able to send and receive data to and from the backend. Vectors and matrices are handled by the same routines as vectors are treated as n-by-1 matrices.
The backend, which runs on a number of processors, uses the Single Program Multiple Data (SPMD) programming topology (which makes the entire application MPMD). One of the processors receives the request (and the arguments) from the frontend, and forwards them to the other processors. Each processor then acts on the information it received, performing the same operations.
The backend processes are organized in a single row of processors, but changing this will require little work as most of the implementation is already dynamic. Matrices are organized in a block cyclic manner used by ScaLAPACK. This means that the first blocking columns are stored on the first processor, the next blocking columns on the second, and so on, until we start the pattern over again on the first processor with the blocking columns following the last column stored on the P'th processor. The block factor may be set by defining the macro SCALABLOCKFACTOR when compiling the library, although a method to modify the block factor at runtime should be found, possibly by giving suggestions to the library.
The actual storage of the matrices is handled by the descriptordata objects, which are stored in a linked list. Each matrix is identified by an integer value. Positive integers identify application matrices, and negative integers identify the descriptors used for temporary matrices. The temporary matrices are resized according to the needs of the operation. There are six temporary matrices, three for operations and three for assignment between matrices. Assignment to an unused (zero sized) temporary matrix will cause automatic adjustment by the distribution routine without any resizing needed by the operation.
The backend routines retrieve, if necessary, the required data from the argument matrices into temporary matrices, then after use, write the result back if needed. The backend may, in cases where it is appropriate, use the argument matrices directly, as is done in the other two versions presented. The direct use requires that certain conditions are present:
The two last points can be decided by the operation using the argument. The other two requirements have to be met.
Most of the operations are following the general algorithm below:
If the answer is yes to any of these questions, a temporary matrix must be used.
Some operations can be handled directly on the frontend by ScaLAPACKclient or the scope.
If any step of the operation fails, the operation throws an exception. On the backend, these exceptions are either caught and an error status is returned to the frontend, or, in the case of compilers that do not support exceptions, that particular process halts, causing the the application to lock up.
Copying data from one submatrix to another is handled by the operation distribute in ScaLAPACKserver which can handle copying:
Two other possibilities, submatrix to submatrix and copy within the same matrix are handled by using a temporary matrix as intermediate storage. More than one submatrix selection is also handled with temporary matrices, but internally in the distribution routine, not by recursion.
Function evaluation is allowed to take place both on the frontend and the backend, in a similar manner as for the MasPar version. Selection between functions that will be used on the frontend, and functions that are used on the backend, is handled by the same method used by the SIMD MasPar version, a global array MatrixRemoteFunctions. This array holds the addresses of MatrixRemoteFunctionCount functions. But, as the backend is a separate application, the remote functions and the list of these functions must be linked into both the user application and the backend application. The developer must therefore create a separate backend, something which must be specially handled by the start up procedures, which in PVM requires the use of the environment variable ``MATRIX_COMMAND''.
Identification of the remote function is achieved by finding which function address entry in the MatrixRemoteFunctions array is the requested function, scanning the list on the frontend, and then transmitting that number to the backend if it is in the list. Otherwise, the evaluation is handled on the frontend by extracting the matrix data into temporary memory, evaluating the function and then loading the matrixdata back onto the backend.
The backend, if used, will use the received number to get the address of the function from its version of the table with remote functions, and will call this function for each element in the matrix. This will be done in any parallel order required by the data layout.
/* In C++ */
double func1(double, int, int); // remote function
double func2(double, int, int); // local function
double (*MatrixRemoteFunctions[])(double,int,int) = {func1};
int MatrixRemoteFunctionCount = 1;
Matrix A(m,n);
A = func1; // Evaluated on the backend;
A = func2; // Evaluated on the frontend;