MPI-3.0 is Coming—an Overview of new (and old) Features

UPDATE: The new MPI-3 book appeared. This book describes all information on this page in a well-written form (including other advanced MPI features) and with examples for practitioners. More information. Direct link to Amazon.

I am involved in the MPI Forum which is developing and ratifying the Message Passing Interface standards. Actually, I managed to attend every single MPI Forum meeting (27 so far) since the Forum reconvened in Jan. 2008 and I also co-authored MPI-2.1 and MPI-2.2.

The MPI Forum strives to release MPI-3.0 asap (which may mean in a year or so ;-) ), so most, if not all significant proposals are in a feature-freeze and polishing stage. I’ll try to summarize the hot topics in MPI-3.0 (in no particular order) here. The Forum is public (join us!) and all meetings and activities are documented at http://meetings.mpi-forum.org/. However, the wiki and meeting structure is hard to follow for people who do not regularly attend the meetings (actually, it’s even hard for people who do so).

The MPI-3.0 ticket process is relatively simple: a ticket is put together by a subgroup or an individual and discussed in the chapter working group. Then it is brought forward for discussion to the full Forum, formally read in a plenary session and voted twice. The reading and the votes happen at different meetings, i.e., a ticket needs at least six months to be ratified (this gives the Forum time to check it for correctness). Non-trivial changes are also not possible after a reading. Both votes have to be passed for the ticket to be ratified. Then, it is integrated into the draft standard by the chapter author(s). Finally, at the end of the process, each chapter is voted by the Forum, and after that (mostly a formality) there will be a vote for the whole standard. Votes are by organization and an organization has to participate regularly in the Forum to be eligible to vote (has to be presented on two of the three meetings before the vote). Input from the public is generally valued and should be communicated through the mailinglist or Forum members.

Keep in mind that this list and the comments are representing my personal view and only the final standard is the last word! You can find the original tickets by appending the ticket ID to https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/ , e.g., https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/109 for nonblocking collective operations.

1) Nonblocking Collective Operations #109

Status: passed

Nonblocking collectives #109 was the first proposal voted into MPI-3.0 more than a year ago (Oct. 2010). The proposal dates back to one of the first meetings in 2008 (I wanted it in MPI-2.2 but we decided to save this “major change” for MPI-3 to make the adoption of MPI-2.2 faster). Since this proposal came first, it was used to define much of the process for MPI-3.0 and it was also probably scrutinized most :-) . Actually, it seems rather simple but there were some subtle corner-cases that needed to be defined. But after all, it allows one to issue “immediate” (that’s where the “I” comes from) collective operations, such as:

MPI_Ibcast(buf, count, type, root, comm, &request);
... // compute
MPI_Wait(&request, &status);

This can be used to overlap computation and communication and enables several use-cases such as software pipelining (cf. Hoefler, Gottschling, Lumsdaine: “Leveraging Non-blocking Collective Communication in High-performance Applications”) or also interesting parallel protocols that require the nonblocking semantics (cf. Hoefler, Siebert, Lumsdaine: “Scalable Communication Protocols for Dynamic Sparse Data Exchange”).

A reference implementation is available with LibNBC and I’m looking forward to optimized platform-specific versions with full asynchronous progression! The latest (svn) version of MPICH2 is already supporting them (while some other MPI implementations are still working on MPI-2.2 compliance).

2) Neighborhood Collectives #258

Status: passed

Neighborhood (formerly aka. sparse) collective operations are extending the distributed graph and Cartesian process topologies with additional communication power. A user can now statically define a communication topology and also perform communication functions between neighbors in this topology! For example:


// create a 3d topology
MPI_Cart_create(comm, 3, {2,2,2}, {1,1,1}, 1, &newcomm);
... // read input data according to process order in newcomm
while(!converged) {
// start neighbor communication
MPI_Ineighbor_alltoall(..., &newcomm, &req);
... // compute inner parts
MPI_Wait(&req, MPI_STATUS_IGNORE);
... // compute outer parts
}

This obviously simplifies the MPI code quite a bit (compared to the old “north, south, west, east” exchanges with extra pack/send/recv/unpack code for each direction) and often improves performance. This can also be nicely combined with MPI datatypes (neighbor_alltoallw) to offer a very high abstraction level. Distributed graph communicators enable the specification of completely arbitrary communication relations. A more complex (application) example is described in Hoefler, Lorenzen, Lumsdaine: “Sparse Non-Blocking Collectives in Quantum Mechanical Calculations”.

The MPI implementation can optimize the topology and the message schedule for those functions in the graph or Cartesian communicator creation call. Optimization opportunities and a neighborhood_reduce call (which the Forum decided to remove from the proposal) are discussed in Hoefler, Traeff: “Sparse Collective Operations for MPI”.

3) Matched probe #38

Status: passed

One of the oldest tickets that we (Doug Gregor, who originally identified the problem when providing C# bindings for MPI, and I) proposed to MPI-2.2. It was deferred to MPI-3.0 for various reasons. This ticket fixes an old bug in MPI-2 where one could not probe for messages in a multi-threaded environment. The issue is somewhat subtle and complex to explain. For a good examples and a description of the complexity of the problem and the performance of the solution, refer to Hoefler, Bronevetsky, Barrett, de Supinski, Lumsdaine: “Efficient MPI Support for Advanced Hybrid Programming Models”.

The new interface works by removing the message at probe time from the matching queue and allowing the receiver to match it later with a special call:


MPI_Mprobe(source, tag, comm, &message, &status);
... // prepare buffer etc.
MPI_Mrecv(buf, count, type, &message, &status);

This avoids “bad” thread interleavings that lead to erroneous receives. Jeff has a good description of the problem in his blog.

4) MPIT Tool Interface #266

Status: passed

The new MPI tool interface allows the MPI implementation to expose certain internal variables, counters, and other states to the user (most likely performance tools). The huge difference to the various predecessor proposals is that it does not impose any specific structure or implementation choice (such as having an eager protocol) on the MPI implementations. Another side-effect of this is that is doesn’t really have to offer anything to the user :-) . However, a “high quality” MPI implementation may use this interface to expose relevant state.

It will certainly be very useful for tools and advanced MPI users to investigate performance issues.

5) C Const Correctness #140

Status: passed

This sounds rather small but came with a major pain to pass it (anybody remembers why?). This ticket basically makes the C interface const-correct, i.e., adds the const qualifier to all C interface functions. All C++ functions already have const qualifiers.

This turns

int MPI_Gather(void* , int, MPI_Datatype, void*, int, MPI_Datatype, int, MPI_Comm);

into

int MPI_Gather(const void* , int, MPI_Datatype, void*, int, MPI_Datatype, int, MPI_Comm);

and thus allows several compiler optimizations and prevents some user errors (produces compiler warnings at least).

6) Updated One Sided Chapter #270

Status: passed

This proposal killed probably half of my free-time in the last years. It started at the Portland Forum meeting in 2009 where another group was proposing to essentially rewrite the MPI-2 One Sided chapter from scratch. I disagreed vehemently because the proposed text would not allow an implementation on systems that were not cache coherent. MPI-2 handled the cache coherency issue very elegantly but was in many places hard to use and even harder to understand.

After a night of re-writing the existing chapter to differentiate between two memory models (essentially “cache-coherent” and “not cache-coherent” in MPI lingo “unified” and “separate” public and private window) a new proposal was born. A subgroup started to bang on it (and erased essentially 80% of the ideas, replacing them with better ones!) and two years later we had what is probably the hairiest part of MPI (memory models are extremely complex). The RMA working group was a lot of fun, many inspiring discussions lead us to a good solution! This chapter was a good example of an excellent group effort!

The new chapter offers:

  • two memory models: one supporting cache-coherent systems (similar to many PGAS languages) and the other one is essentially the “old” MPI-2 model
  • different ordering modes for accumulate accesses (warning: the safe default mode may be easy to reason about but slower)
  • MPI_Win_allocate, a collective window creation function that allocates (potentially symmetric or specialized) memory for faster One Sided access
  • MPI_Win_create_dynamic, a mechanism to create a window that spans the whole address space together with functions to register (MPI_Win_attach) and deregister (MPI_Win_detach) memory locally
  • MPI_Get_accumulate, a fetch-and-accumulate function to atomically fetch and apply an operation to a variable
  • MPI_Fetch_and_op, a more specialized version of MPI_Get_accumulate with less parameters for atomic access to scalars only
  • MPI_Compare_and_swap, a CAS function as we know it from shared memory multiprogramming
  • MPI_R{put,get,accumulate,get_accumulate}, request-based MPI functions for local completion checking without window synchronization functions
  • MPI_Win_{un}lock_all, a function to (un)lock all processes in a window from a single process (not collective!)
  • MPI_Win_flush{_all}, a way to complete all outstanding operations to a specific target process (or all processes). Upon return of this function, the operation completed at the target (either in the private or public window copy)
  • MPI_Win_flush_local{_all}, a function to complete all operations locally to a specified process (or all processes). This does not include remote completion but local buffers can be re-used
  • conflicting accesses are now allowed but the outcome is undefined (and may corrupt the window). This is similar to the C++ memory model

Of course, nobody can understand the power of the new One Sided interface based on this small list without examples. The One Sided working group is working on more documentation and similar posts, I plan to link or mirror them here!

7) Allocating a Shared Memory Window #284

Status: read

Several groups wanted the ability to create shared memory in MPI. This would allow to share data-structures across all MPI processes in a multicore node similarly to OpenMP. However, unlike OpenMP, one would just share single objects (arrays etc.) and not the whole address space. The idea here is to combine this with One Sided and allow to create a window which is accessible with load/store (ISA) instructions to all participating processes.

This extends the already complex One Sided chapter (semantics) with the concept of local and remote memory. The proposal is still under discussion and may change. Currently, one can create such a window with MPI_Win_allocate_shared(size, info, comm, baseptr, win) and then use One Sided synchronization (flush and friends) to access it.

By default, the allocated memory is contiguous across process boundaries (process x’s memory starts after process x-1′s memory ends). The info argument alloc_shared_noncontig can be used to relax this and allow the implementation to allocate memory close to a process (on NUMA systems). Then, the user has to use the function MPI_Win_shared_query() to determine the base address of remote processes’ memory segment.

MPI-3.0 will also offer a special communicator split function that can be used to create a set of communicators which only include processes that can create a shared memory window (i.e., mutually share memory).

8 ) Noncollective Communicator Creation #286

Status: read

A very interesting proposal to allow a group of processes to create a communicator “on their own”, i.e., without involving the full parent communicator. This would ve very useful for MPI fault tolerance, where it could be used to “fix” a broken communicator (create a communicator with less processes). Compare this to Gropp, Lusk: “Fault Tolerance in MPI Programs”. This could be achieved with current functions but would be slow and cumbersome, see Dinan et al.: Noncollective Communicator Creation in MPI.

9) Nonblocking MPI_Comm_dup #168

Status: read

This very simple proposal allows to duplicate communicators in a nonblocking way. This allows to overlap the communicator creation latency and also implement “purely” nonblocking functions without initialization calls (cf. Hoefler, Snir: “Writing Parallel Libraries with MPI – Common Practice, Issues, and Extensions”). There is not much more to say about this simple call :-) .

10) Fortran Bindings #229 (+24 more!)

Status: voted

This is a supposedly simple ticket that was developed to improve the Fortran bindings and add Fortran 2008 bindings. It offers type-safety and tries to resolve the issue with Fortran code movement (relying on Fortran TR 29113). I am not a Fortran expert (preferring C++ for scientific computing instead) so I can’t really speak to it. Jeff has a good post on this.

That’s it! Well, I am 100% sure that I forgot several proposals (some may even be still in the pipeline or below the radar) and I’ll add them here as they show up. We also already postponed several features to MPI-3.1, so the MPI train continues to run.

The Forum is also actively seeking feedback from the community. If you are interested in any of the described features, please give the draft standard a read and let us know if you have concerns (or praise :-) )!

Torsten Hoefler

Some statistics for 2012 – my average speed was 17.4 km/h

I just looked at some statistics from 2011 :-) .

I completed 670 tasks (that were enough effort to put them on my tasklist), about 2 tasks a day. I received (and read) 20555 emails (after filtering mailinglists and spam!), about 56 emails/day, I sent 5688 emails, about 16 emails/day. Way too many, it’s already consuming a significant fraction of my time! I flew a total of approximately 95.000 miles (~176.000 km), which made me travel at an average speed of 17.4 km/h over the year. I feel bad for the caused carbon footprint but this means that I also spent at least 300 hours (about 12 days) in planes (this is a lower bound estimate). Oh well! I hope 2012 gets quieter :-) .