-
Notifications
You must be signed in to change notification settings - Fork 2
MPI Wrappers
The MPI-standard defines a profiling interface intended to intercept calls to
MPI-functions and inject custom code
(see MPI-4.0 Section 15.2).
Every MPI-function comes in two versions.
A MPI_
prefixed version and a PMPI_
prefixed version with identical functionality.
#include <mpi.h>
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
}
#include <mpi.h>
int main(int argc, char **argv) {
PMPI_Init(&argc, &argv);
int rank;
PMPI_Comm_rank(MPI_COMM_WORLD, &rank);
PMPI_Barrier(MPI_COMM_WORLD);
PMPI_Finalize();
}
Note that in order to compile codes with PMPI_
symbols NEC-MPI requires the -mpiprof
flag.
OpenMPI works without adjusting compile flags.
This redundancy of functionality allows to overwrite one of the functions without loosing the functionality.
The following example overwrites the MPI_Barrier
function to print out information about the barrier and uses the PMPI_Barrier to retain the barrier-functionality:
#include <stdio.h>
#include <mpi.h>
int MPI_Barrier(MPI_Comm comm) {
int rank;
PMPI_Comm_rank(comm, &rank);
printf("Rank %d hits Barrier\n", rank);
PMPI_Barrier(comm);
}
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
}
$ mpicc -o test.x test.c
$ mpirun -np 3 ./test.x
Rank 0 hits Barrier
Rank 1 hits Barrier
Rank 2 hits Barrier
Vftrace overwrites every MPI-communication routine for C, Fortran, and Fortran08 (MPI-3.1) in order to collect information about send/received messages.
The wrappers check if mpi-logging is active (see Runtime Control#Config File).
If it is inactive the MPI-Communication is executed utilizing the PMPI_
functionality.
If it is active a translation routine vftr_MPI_<function>_<language>2vftr
is called that converts the arguments.
int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) {
...
if (vftr_no_mpi_logging()) {
return PMPI_Bcast(buffer, count, datatype, root, comm);
} else {
return vftr_MPI_Bcast_c2vftr(buffer, count, datatype, root, comm);
}
}
SUBROUTINE MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, ERROR)
...
IF (vftr_no_mpi_logging_F()) THEN
CALL PMPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, ERROR)
ELSE
CALL vftr_MPI_Bcast_f2vftr(BUFFER, COUNT, DATATYPE, ROOT, COMM, ERROR)
END IF
END SUBROUTINE MPI_BCAST
SUBROUTINE MPI_Bcast_f08(buffer, count, datatype, root, comm, error)
...
IF (vftr_no_mpi_logging_f08()) THEN
CALL PMPI_Bcast(buffer, count, datatype, root, comm, tmperror)
ELSE
CALL vftr_MPI_Bcast_f082vftr(buffer, count, datatype%MPI_VAL, root, comm%MPI_VAL, tmperror)
END IF
IF (PRESENT(error)) error = tmperror
END SUBROUTINE MPI_Bcast_f08
The translation routines (vftr_MPI_<function>_<language>2vftr
) are called by the wrappers
in order to translate the arguments of the wrapper from C, Fortran, and Fortran08, to a common format and back.
This way the profiling logic needs to be implemented once only, resulting in less error prone code and more maintainability.
The translation layer also checks some arguments like (inter/intra-communicators, Inplace-Buffers, ...)
and calls the corresponding profiling routines, which reduces complexity of each individual profiling routine.
For C the translation routine hands the incoming arguments over unchanged, as they are already in the correct format.
int vftr_MPI_Bcast_c2vftr(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) {
// determine if inter or intra communicator
int isintercom;
PMPI_Comm_test_inter(comm, &isintercom);
if (isintercom) {
return vftr_MPI_Bcast_intercom(buffer, count, datatype, root, comm);
} else {
return vftr_MPI_Bcast(buffer, count, datatype, root, comm);
}
}
The fortran wrappers need translation of the arguments to a C-like format. On the fortran side a Fortran-C interface needs to be defined in order to implement the translation routine itself in C.
SUBROUTINE vftr_MPI_Bcast_f2vftr(BUFFER, COUNT, F_DATATYPE, ROOT, F_COMM, F_ERROR) &
BIND(C, name="vftr_MPI_Bcast_f2vftr")
IMPLICIT NONE
INTEGER BUFFER
INTEGER COUNT
INTEGER F_DATATYPE
INTEGER ROOT
INTEGER F_COMM
INTEGER F_ERROR
END SUBROUTINE vftr_MPI_Bcast_f2vftr
void vftr_MPI_Bcast_f2vftr(void *buffer, MPI_Fint *count, MPI_Fint *f_datatype, MPI_Fint *root, MPI_Fint *f_comm, MPI_Fint *f_error) {
MPI_Datatype c_datatype = PMPI_Type_f2c(*f_datatype);
MPI_Comm c_comm = PMPI_Comm_f2c(*f_comm);
int c_error;
int isintercom;
PMPI_Comm_test_inter(c_comm, &isintercom);
if (isintercom) {
c_error = vftr_MPI_Bcast_intercom(buffer, (int)(*count), c_datatype, (int)(*root), c_comm);
} else {
c_error = vftr_MPI_Bcast(buffer, (int)(*count), c_datatype, (int)(*root), c_comm);
}
*f_error = (MPI_Fint) (c_error);
}
The translation routines and interfaces look almost identical to the ones for fortran.
The profiling routines for blocking (e.g. bcast) and non-blocking (e.g. ibcast) are fundamentally different.
The first action in the profiling routines is to execute the communication via the PMPI_
functionality, while measuring the time of the communication in order to compute bandwidths.
int vftr_MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) {
long long tstart = vftr_get_runtime_nsec();
int retVal = PMPI_Bcast(buffer, count, datatype, root, comm);
long long tend = vftr_get_runtime_nsec();
...
All communications are broken up into pairwise messages in order to gather rank-to-rank information.
The precise pattern of pair messages depends on the selected communication routine.
The pairwise message information is stored in the profile and written to the vfd-files by vftr_stroe_sync_message_info
.
if (rank == root) {
int size;
PMPI_Comm_size(comm, &size);
for (int i=0; i<size; i++) {
vftr_store_sync_message_info(send, count, datatype, i, -1,
comm, tstart, tend);
}
} else {
vftr_store_sync_message_info(recv, count, datatype, root, -1,
comm, tstart, tend);
}
Last the overhead of the profiling is added to the profiling information. The overhead of the wrappers and translation layers is ignored. Measuring and storing it would introduce a comparatively large overhead.
long long t2end = vftr_get_runtime_nsec();
vftr_accumulate_mpiprofiling_overhead(&(my_profile->mpiprof), t2end-t2start);
As was the case for the blocking communication, the first action is to execute the communication.
In contrast to the blocking case only the start time is needed, as the PMPI_I
will return immediately anyways.
The end timestamp for determining the bandwidth will be taken later.
int vftr_MPI_Ibcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm, MPI_Request *request) {
long long tstart = vftr_get_runtime_nsec();
int retVal = PMPI_Ibcast(buffer, count, datatype, root, comm, request);
Next all the information needed to store the profiling data is copied and stored together
with the request, by registering it in an open-request-list by the vftr_register_collective_request
function in vftrace.mpi_state.open_requests
.
int *tmpcount = (int*) malloc(sizeof(int)*size);
MPI_Datatype *tmpdatatype = (MPI_Datatype*) malloc(sizeof(MPI_Datatype)*size);
int *tmppeer_ranks = (int*) malloc(sizeof(int)*size);
// messages to be send
for (int i=0; i<size; i++) {
tmpcount[i] = count;
tmpdatatype[i] = datatype;
tmppeer_ranks[i] = i;
}
vftr_register_collective_request(send, size, tmpcount, tmpdatatype,
tmppeer_ranks, comm,
*request, 0, NULL, tstart);
// cleanup temporary arrays
free(tmpcount);
tmpcount = NULL;
free(tmpdatatype);
tmpdatatype = NULL;
free(tmppeer_ranks);
tmppeer_ranks = NULL;
Last, as with the blocking communication profiling, the overhead is measured and accumulated in the profiles.
long long t2end = vftr_get_runtime_nsec();
vftr_accumulate_mpiprofiling_overhead(&(my_profile->mpiprof), t2end-t2start);
A persistent communication consists of several steps.
First the communication is initialized by defining the buffers, peer-ranks, etc. returning a persistent request.
Then the communication can be started with MPI_Start
or MPI_Startall
over and over again by passing the persistent request.
In Vftrace the communication init routines start with calling the PMPI_
version in order to initialize the communication properly.
Timestamps are not necessary, because the communication did not start yet, and the call returns immediately.
int vftr_MPI_Send_init(const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) {
int retVal = PMPI_Send_init(buf, count, datatype, dest, tag, comm, request);
Next all the arguments are stored in a request similarly to the non-blocking communication case. However the request is marked as persistent, which will be important when requests are completed.
vftr_register_pers_p2p_request(send, count, datatype, dest, tag, comm, *request);
Last the overhead is accumulated again.
long long t2end = vftr_get_runtime_nsec();
vftr_accumulate_mpiprofiling_overhead(&(my_profile->mpiprof), t2end-t2start);
return retVal;
}
When the persistent communication is actually executed by MPI_Start
or MPI_Startall
the previously registered request is activated with a current timestamp as start time after executing the PMPI_
functionality.
int vftr_MPI_Start(MPI_Request *request) {
long long tstart = vftr_get_runtime_nsec();
int retVal = PMPI_Start(request);
long long t2start = vftr_get_runtime_nsec();
vftr_activate_pers_request(*request, tstart);
Last the overhead is accumulated again
long long t2end = vftr_get_runtime_nsec();
vftr_accumulate_mpiprofiling_overhead(&(my_profile->mpiprof), t2end-t2start);
return retVal;
}
The activation of a persistent request is done by first locating it in the open-request-list and setting its state to active and storing the timestamp.
void vftr_activate_pers_request(MPI_Request request, long long tstart) {
// search for request in open request list
vftr_request_t *matching_request = vftr_search_request(request);
if (matching_request != NULL) {
matching_request->active = true;
matching_request->tstart = tstart;
}
}
In MPI every request needs to be finalized by a call to
MPI_Wait
, MPI_Waitany
, MPI_Waitsome
, MPI_Waitall
, or
MPI_Test
, MPI_Testany
, MPI_Testsome
, MPI_Testall
with a positive completion flag.
Because non-blocking communication runs in the background, the communication could be already long finished,
before one of the routines given above is called to finalize it.
Therefore vftrace checks all registered requests for completion
every time a profiling function hook is called (see Function Hooks)
by calling vftr_clear_completed_requests()
.
MPI requests can be non-destructively queried for completion with PMPI_Request_get_status
.
If a request is found to be completed the end timestamp is taken
and the information in the request is accumulated in the appropriate profiling struct,
and optionally written to the vfd-files.
bool should_log_message = vftr_should_log_message_info(vftrace.mpi_state, request->rank[i]);
if (should_log_message) {
vftr_accumulate_message_info(&(my_profile->mpiprof), request->dir, request->count[i], request->type_idx[i], request->type_size[i], request->rank[i], request->tag, request->tstart, tend);
if (vftrace.config.sampling.active.value) {
vftr_write_message_info(request->dir, request->count[i], request->type_idx[i], request->type_size[i], request->rank[i], request->tag, request->tstart, tend, request->callingstackID, request->callingthreadID);
}
}
If the request is a persistent one it is deactivated, but stays on the list for further MPI_Start(all)
calls.
Otherwise it is deleted from the list.
The MPI-internal request is not touched by this procedure.
The MPI_Wait*
routines in Vftrace are basically busy loops that check for completion of the given requests repeatedly.
During the waiting other requests are checked for completion as well, too.
Once the supplied request is completed and all profiling information collected,
the PMPI_
function is employed to properly tread the MPI-internal request.
int vftr_MPI_Wait(MPI_Request *request, MPI_Status *status) {
// loop until the communication corresponding to the request is completed
int flag = false;
while (!flag) {
// check if the communication is finished
retVal = PMPI_Request_get_status(*request, &flag, status);
// either the communication is completed, or not
// other communications might be completed in the background
// clear those from the list of open requests
vftr_clear_completed_requests_from_wait();
}
// Properly set the request and status variable
retVal = PMPI_Wait(request, status);
return retVal;
}
The MPI_Test*
routines work similarly, except that they do not loop.
They check once and return.