Skip to content

MPI Wrappers

Felix Uhl edited this page Dec 1, 2022 · 6 revisions

MPI-Profiling Interface

The MPI-standard defines a profiling interface intended to intercept calls to MPI-functions and inject custom code (see MPI-4.0 Section 15.2). Every MPI-function comes in two versions. A MPI_ prefixed version and a PMPI_ prefixed version with identical functionality.

#include <mpi.h>
int main(int argc, char **argv) {
   MPI_Init(&argc, &argv);
   int rank;
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   MPI_Barrier(MPI_COMM_WORLD);
   MPI_Finalize();
}
#include <mpi.h>
int main(int argc, char **argv) {
   PMPI_Init(&argc, &argv);
   int rank;
   PMPI_Comm_rank(MPI_COMM_WORLD, &rank);
   PMPI_Barrier(MPI_COMM_WORLD);
   PMPI_Finalize();
}

Note that in order to compile codes with PMPI_ symbols NEC-MPI requires the -mpiprof flag. OpenMPI works without adjusting compile flags. This redundancy of functionality allows to overwrite one of the functions without loosing the functionality. The following example overwrites the MPI_Barrier function to print out information about the barrier and uses the PMPI_Barrier to retain the barrier-functionality:

#include <stdio.h>
#include <mpi.h>
int MPI_Barrier(MPI_Comm comm) {
   int rank;
   PMPI_Comm_rank(comm, &rank);
   printf("Rank %d hits Barrier\n", rank);
   PMPI_Barrier(comm);
}
int main(int argc, char **argv) {
   MPI_Init(&argc, &argv);
   int rank;
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   MPI_Barrier(MPI_COMM_WORLD);
   MPI_Finalize();
}
$ mpicc -o test.x test.c
$ mpirun -np 3 ./test.x
Rank 0 hits Barrier
Rank 1 hits Barrier
Rank 2 hits Barrier

MPI-Profiling in Vftrace

Vftrace overwrites every MPI-communication routine for C, Fortran, and Fortran08 (MPI-3.1) in order to collect information about send/received messages.

MPI-Wrappers

The wrappers check if mpi-logging is active (see Runtime Control#Config File). If it is inactive the MPI-Communication is executed utilizing the PMPI_ functionality. If it is active a translation routine vftr_MPI_<function>_<language>2vftr is called that converts the arguments.

int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) {
...
   if (vftr_no_mpi_logging()) {
      return PMPI_Bcast(buffer, count, datatype, root, comm);
   } else {
      return vftr_MPI_Bcast_c2vftr(buffer, count, datatype, root, comm);
   }
}
SUBROUTINE MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, ERROR)
...
   IF (vftr_no_mpi_logging_F()) THEN
      CALL PMPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, ERROR)
   ELSE
      CALL vftr_MPI_Bcast_f2vftr(BUFFER, COUNT, DATATYPE, ROOT, COMM, ERROR)
   END IF
END SUBROUTINE MPI_BCAST
SUBROUTINE MPI_Bcast_f08(buffer, count, datatype, root, comm, error)
...
   IF (vftr_no_mpi_logging_f08()) THEN
      CALL PMPI_Bcast(buffer, count, datatype, root, comm, tmperror)
   ELSE
      CALL vftr_MPI_Bcast_f082vftr(buffer, count, datatype%MPI_VAL, root, comm%MPI_VAL, tmperror)
   END IF
   IF (PRESENT(error)) error = tmperror
END SUBROUTINE MPI_Bcast_f08

Translation Layer

The translation routines (vftr_MPI_<function>_<language>2vftr) are called by the wrappers in order to translate the arguments of the wrapper from C, Fortran, and Fortran08, to a common format and back. This way the profiling logic needs to be implemented once only, resulting in less error prone code and more maintainability. The translation layer also checks some arguments like (inter/intra-communicators, Inplace-Buffers, ...) and calls the corresponding profiling routines, which reduces complexity of each individual profiling routine.

C-2-vftr

For C the translation routine hands the incoming arguments over unchanged, as they are already in the correct format.

int vftr_MPI_Bcast_c2vftr(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) {
   // determine if inter or intra communicator
   int isintercom;
   PMPI_Comm_test_inter(comm, &isintercom);
   if (isintercom) {
      return vftr_MPI_Bcast_intercom(buffer, count, datatype, root, comm);
   } else {
      return vftr_MPI_Bcast(buffer, count, datatype, root, comm);
   }   
}

Fotran-2-vftr

The fortran wrappers need translation of the arguments to a C-like format. On the fortran side a Fortran-C interface needs to be defined in order to implement the translation routine itself in C.

SUBROUTINE vftr_MPI_Bcast_f2vftr(BUFFER, COUNT, F_DATATYPE, ROOT, F_COMM, F_ERROR) &
   BIND(C, name="vftr_MPI_Bcast_f2vftr")
   IMPLICIT NONE
   INTEGER BUFFER
   INTEGER COUNT
   INTEGER F_DATATYPE
   INTEGER ROOT
   INTEGER F_COMM
   INTEGER F_ERROR
END SUBROUTINE vftr_MPI_Bcast_f2vftr
void vftr_MPI_Bcast_f2vftr(void *buffer, MPI_Fint *count, MPI_Fint *f_datatype, MPI_Fint *root, MPI_Fint *f_comm, MPI_Fint *f_error) {
   MPI_Datatype c_datatype = PMPI_Type_f2c(*f_datatype);
   MPI_Comm c_comm = PMPI_Comm_f2c(*f_comm);
   int c_error;
   int isintercom;
   PMPI_Comm_test_inter(c_comm, &isintercom);
   if (isintercom) {
      c_error = vftr_MPI_Bcast_intercom(buffer, (int)(*count), c_datatype, (int)(*root), c_comm);
   } else {
      c_error = vftr_MPI_Bcast(buffer, (int)(*count), c_datatype, (int)(*root), c_comm);
   }   
   *f_error = (MPI_Fint) (c_error);
}

Fotran08-2-vftr

The translation routines and interfaces look almost identical to the ones for fortran.

Profiling Routines

The profiling routines for blocking (e.g. bcast) and non-blocking (e.g. ibcast) are fundamentally different.

Blocking Communication

The first action in the profiling routines is to execute the communication via the PMPI_ functionality, while measuring the time of the communication in order to compute bandwidths.

int vftr_MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) {
   long long tstart = vftr_get_runtime_nsec();
   int retVal = PMPI_Bcast(buffer, count, datatype, root, comm);
   long long tend = vftr_get_runtime_nsec();
...

All communications are broken up into pairwise messages in order to gather rank-to-rank information. The precise pattern of pair messages depends on the selected communication routine. The pairwise message information is stored in the profile and written to the vfd-files by vftr_stroe_sync_message_info.

   if (rank == root) {
      int size;
      PMPI_Comm_size(comm, &size);
      for (int i=0; i<size; i++) {
         vftr_store_sync_message_info(send, count, datatype, i, -1, 
                                      comm, tstart, tend);
      }   
   } else {
      vftr_store_sync_message_info(recv, count, datatype, root, -1, 
                                   comm, tstart, tend);
   }

Last the overhead of the profiling is added to the profiling information. The overhead of the wrappers and translation layers is ignored. Measuring and storing it would introduce a comparatively large overhead.

   long long t2end = vftr_get_runtime_nsec();
   vftr_accumulate_mpiprofiling_overhead(&(my_profile->mpiprof), t2end-t2start);

Non-Blocking Communication

As was the case for the blocking communication, the first action is to execute the communication. In contrast to the blocking case only the start time is needed, as the PMPI_I will return immediately anyways. The end timestamp for determining the bandwidth will be taken later.

int vftr_MPI_Ibcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm, MPI_Request *request) {
   long long tstart = vftr_get_runtime_nsec();
   int retVal = PMPI_Ibcast(buffer, count, datatype, root, comm, request);

Next all the information needed to store the profiling data is copied and stored together with the request, by registering it in an open-request-list by the vftr_register_collective_request function in vftrace.mpi_state.open_requests.

    int *tmpcount = (int*) malloc(sizeof(int)*size);
    MPI_Datatype *tmpdatatype = (MPI_Datatype*) malloc(sizeof(MPI_Datatype)*size);
    int *tmppeer_ranks = (int*) malloc(sizeof(int)*size);
    // messages to be send
    for (int i=0; i<size; i++) {
       tmpcount[i] = count;
       tmpdatatype[i] = datatype;
       tmppeer_ranks[i] = i;
    }   
    vftr_register_collective_request(send, size, tmpcount, tmpdatatype,
                                     tmppeer_ranks, comm,
                                     *request, 0, NULL, tstart);
    // cleanup temporary arrays
    free(tmpcount);
    tmpcount = NULL;
    free(tmpdatatype);
    tmpdatatype = NULL;
    free(tmppeer_ranks);
    tmppeer_ranks = NULL;

Last, as with the blocking communication profiling, the overhead is measured and accumulated in the profiles.

    long long t2end = vftr_get_runtime_nsec();
    vftr_accumulate_mpiprofiling_overhead(&(my_profile->mpiprof), t2end-t2start);

Persistent Communication

A persistent communication consists of several steps. First the communication is initialized by defining the buffers, peer-ranks, etc. returning a persistent request. Then the communication can be started with MPI_Start or MPI_Startall over and over again by passing the persistent request.

In Vftrace the communication init routines start with calling the PMPI_ version in order to initialize the communication properly. Timestamps are not necessary, because the communication did not start yet, and the call returns immediately.

int vftr_MPI_Send_init(const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) {
   int retVal = PMPI_Send_init(buf, count, datatype, dest, tag, comm, request);

Next all the arguments are stored in a request similarly to the non-blocking communication case. However the request is marked as persistent, which will be important when requests are completed.

   vftr_register_pers_p2p_request(send, count, datatype, dest, tag, comm, *request);

Last the overhead is accumulated again.

   long long t2end = vftr_get_runtime_nsec();
   vftr_accumulate_mpiprofiling_overhead(&(my_profile->mpiprof), t2end-t2start);
   return retVal;
}

When the persistent communication is actually executed by MPI_Start or MPI_Startall the previously registered request is activated with a current timestamp as start time after executing the PMPI_ functionality.

int vftr_MPI_Start(MPI_Request *request) {
   long long tstart = vftr_get_runtime_nsec();
   int retVal = PMPI_Start(request);
   long long t2start = vftr_get_runtime_nsec();
   vftr_activate_pers_request(*request, tstart);

Last the overhead is accumulated again

   long long t2end = vftr_get_runtime_nsec();
   vftr_accumulate_mpiprofiling_overhead(&(my_profile->mpiprof), t2end-t2start);
   return retVal;
}

The activation of a persistent request is done by first locating it in the open-request-list and setting its state to active and storing the timestamp.

void vftr_activate_pers_request(MPI_Request request, long long tstart) {
   // search for request in open request list
   vftr_request_t *matching_request = vftr_search_request(request);
   if (matching_request != NULL) {
      matching_request->active = true;
      matching_request->tstart = tstart;
   }
}

Clearing requests

In MPI every request needs to be finalized by a call to MPI_Wait, MPI_Waitany, MPI_Waitsome, MPI_Waitall, or MPI_Test, MPI_Testany, MPI_Testsome, MPI_Testall with a positive completion flag. Because non-blocking communication runs in the background, the communication could be already long finished, before one of the routines given above is called to finalize it. Therefore vftrace checks all registered requests for completion every time a profiling function hook is called (see Function Hooks) by calling vftr_clear_completed_requests(). MPI requests can be non-destructively queried for completion with PMPI_Request_get_status. If a request is found to be completed the end timestamp is taken and the information in the request is accumulated in the appropriate profiling struct, and optionally written to the vfd-files.

   bool should_log_message = vftr_should_log_message_info(vftrace.mpi_state, request->rank[i]);
   if (should_log_message) {
      vftr_accumulate_message_info(&(my_profile->mpiprof), request->dir, request->count[i], request->type_idx[i], request->type_size[i], request->rank[i], request->tag, request->tstart, tend);
      if (vftrace.config.sampling.active.value) {
         vftr_write_message_info(request->dir, request->count[i], request->type_idx[i], request->type_size[i], request->rank[i], request->tag, request->tstart, tend, request->callingstackID, request->callingthreadID);
      }
   }

If the request is a persistent one it is deactivated, but stays on the list for further MPI_Start(all) calls. Otherwise it is deleted from the list. The MPI-internal request is not touched by this procedure.

Wait/Test

The MPI_Wait* routines in Vftrace are basically busy loops that check for completion of the given requests repeatedly. During the waiting other requests are checked for completion as well, too. Once the supplied request is completed and all profiling information collected, the PMPI_ function is employed to properly tread the MPI-internal request.

int vftr_MPI_Wait(MPI_Request *request, MPI_Status *status) {
   // loop until the communication corresponding to the request is completed
   int flag = false;
   while (!flag) {
      // check if the communication is finished
      retVal = PMPI_Request_get_status(*request, &flag, status);
      // either the communication is completed, or not
      // other communications might be completed in the background
      // clear those from the list of open requests
      vftr_clear_completed_requests_from_wait();
   }   
   // Properly set the request and status variable
   retVal = PMPI_Wait(request, status);
   return retVal;
}

The MPI_Test* routines work similarly, except that they do not loop. They check once and return.