Advanced MPI-2.2 and MPI-3.0 Tutorial in Lugano

I gave my first tutorial about advanced MPI-2.2 usage and the upcoming MPI-3.0 standard in Lugano this week. Even though it was a lot of work to prepare the slides and especially the hands-on exercises, I have to say that it was a lot of fun. The user interaction was great and I learned a lot about how (Swiss) applications use MPI and what is required from future interfaces. The people at CSCS are also exceptionally nice and I very much enjoyed dinner with some!

The agenda and the slides of the tutorial are available on my tutorial website for this course. CSCS recorded everything and there may be a slide-cast soon.

The hands-on experience was also great! I was not expecting that so many folks would complete the exercises. I also realized that some things are somewhat non-intuitive in MPI. A good learning experience for me!

I’m looking forward to present more of those tutorials! If you missed the one in Lugano, I will present a slightly shorter version of the same tutorial with Martin Schulz at ISC’12.

My time in Lugano was also great, it’s just such a beautiful place:
916
lugano1

And the train ride back to Zurich was also wonderful. Some impressions:
train1
train1
train1
train1

“Interim Technical Program Manager Applications” for Blue Waters

Actually, since March 1st (I wish I had more time for blogging and other things) I’m a manager, who would have thought that? I was promoted into this interim position after Bob Fiedler, who held it before, decided to retire from UI (now working for Cray). I will most likely hold this position until I leave to ETH at the end of July and I see it as a very interesting opportunity to gain some important experiences in the five month period.

One part of this job is to manage the Advanced Application and User Support (AuS) group in the Blue Waters project. This is not your usual run-off-the-mill user support group but 11 domain experts at masters and Ph.D. level who can talk to the application developers and users as peers. Each so called “point of contact” (PoC) is an expert in a particular domain (e.g., CFD, quantum chemistry, MD, QCD, …) and can advise users at a very high domain-specific level. This is possible because Blue Waters has, as a national resource, a very small user community (approx. 30 teams) and each PoC is working with 2-3 teams and ensures that the system is used efficiently and effectively. It was already very interesting to get AuS up and running for the first early users mid March. We had full support from day one and it went rather smooth!

I am also responsible for application and benchmark performance and certifying SoW items. This sounds much less exciting than it actually is, well, most of it at least :-). It’s a big and fascinating system and network!

Even though my research output may be hurt for some months (I hope my students keep it up ;-)), I think this is a great opportunity for gaining experience in managing a team of professionals and accepting a large-scale supercomputer. Thanks for your trust NCSA!

And even better, I got a new and awesome office! IMHO, one of the best offices in the building:

office1

office2
A large whiteboard! Finally 🙂

office3
A premium view! Well, sometimes the sun annoys me a bit ;-).

office4
And a great view on the beautiful Siebel Center!

SIAM SIAG/SC Junior Scientist Prize

I received the SIAM Supercomputing Junior Scientist Prize this year! The description of this biannual award (via SIAM) is

The SIAM Activity Group on Supercomputing (SIAG/SC) Junior Scientist Prize, established in 2009, is awarded to an outstanding junior researcher in the field of algorithms research and development for parallel scientific and engineering computing, for distinguished contributions to the field in the three calendar years prior to the year of the award.

I was invited to the biannual SIAM-PP (Parallel Processing) conference to give a plenary award lecture. I talked about an holistic approach for optimization of parallel codes, similarly to optimizations in serial codes. I hope we will soon be able to automate this process for parallel (at least MPI?) codes!

The presentation is archived at https://live.blueskybroadcast.com/bsb/client/CL_DEFAULT.asp?Client=975312&PCAT=4072&CAT=4075 .

I am obviously very happy that I was selected as the 2012 recipient and want to thank everybody who worked with me (and contributed in this way) in the last years. It was an amazing experience to work with Andrew and colleagues in the Open Systems Lab (now CREST), it certainly widened my horizon significantly. Also, Andrew supported me always pursuing crazy and not so crazy ideas and the freedom I had was unseen. I was almost sad when I had to leave for UIUC, but everybody has to go (temporarily in case of OSL :-)) but there was just no way to argue with or resist Marc Snir (if you know him, you know what I am talking about). I am extremely thankful for the many afternoons Marc spent with me talking about parallel computing and providing amazing insights into the background or “structure” of the problems. I am so happy to be able to work with him. And being at UIUC, Bill Gropp provided me a huge amount of guidance advice (professionally as well as personally) and opportunities to collaborate. I also enjoyed (and enjoy now even more!) the guidance of Bill Kramer in terms of management abilities and problem solving. It’s simply an amazing environment!

Here are some impressions from the award:
audience
The talk was well attended even though it was after 7pm :-).

torsten1_small
Kamesh Madduri and David Bader presented me the award.

siam-award2
The plaque!

siam-plaque
Another plaque!

MPI-3.0 chugging along

Here are some updates from the March MPI Forum. We decided that the door has closed for new proposals, so MPI-3.0 could be ratified in the December meeting if everything goes well!

Otherwise, we made huge progress on many small things. Many readings and votes on minor tickets and the results can be found here. The most interesting proposals for me were #284 (Allocate a shared memory window), #286 (Noncollective Communicator Creation), and #168 (Nonblocking Communicator Duplication), which all passed their first vote. The Fortran bindings ticket #229 passed it’s second vote! Scalable vector collectives (#264) were postponed to the next MPI version because the Forum felt that they would need more investigation of several alternative options.

I explained those and other interesting tickets in my last post on MPI-3.0.

We also made substantial progress on Fault Tolerance (which remains a controversial topic for several reasons) and a lot of cleanup (thanks Rolf!). The next meeting in Japan will be exciting!

MPI-3.0 is Coming—an Overview of new (and old) Features

UPDATE: The new MPI-3 book appeared. This book describes all information on this page in a well-written form (including other advanced MPI features) and with examples for practitioners. More information. Direct link to Amazon.

I am involved in the MPI Forum which is developing and ratifying the Message Passing Interface standards. Actually, I managed to attend every single MPI Forum meeting (27 so far) since the Forum reconvened in Jan. 2008 and I also co-authored MPI-2.1 and MPI-2.2.

The MPI Forum strives to release MPI-3.0 asap (which may mean in a year or so ;-)), so most, if not all significant proposals are in a feature-freeze and polishing stage. I’ll try to summarize the hot topics in MPI-3.0 (in no particular order) here. The Forum is public (join us!) and all meetings and activities are documented at http://meetings.mpi-forum.org/. However, the wiki and meeting structure is hard to follow for people who do not regularly attend the meetings (actually, it’s even hard for people who do so).

The MPI-3.0 ticket process is relatively simple: a ticket is put together by a subgroup or an individual and discussed in the chapter working group. Then it is brought forward for discussion to the full Forum, formally read in a plenary session and voted twice. The reading and the votes happen at different meetings, i.e., a ticket needs at least six months to be ratified (this gives the Forum time to check it for correctness). Non-trivial changes are also not possible after a reading. Both votes have to be passed for the ticket to be ratified. Then, it is integrated into the draft standard by the chapter author(s). Finally, at the end of the process, each chapter is voted by the Forum, and after that (mostly a formality) there will be a vote for the whole standard. Votes are by organization and an organization has to participate regularly in the Forum to be eligible to vote (has to be presented on two of the three meetings before the vote). Input from the public is generally valued and should be communicated through the mailinglist or Forum members.

Keep in mind that this list and the comments are representing my personal view and only the final standard is the last word! You can find the original tickets by appending the ticket ID to https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/ , e.g., https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/109 for nonblocking collective operations.

1) Nonblocking Collective Operations #109

Status: passed

Nonblocking collectives #109 was the first proposal voted into MPI-3.0 more than a year ago (Oct. 2010). The proposal dates back to one of the first meetings in 2008 (I wanted it in MPI-2.2 but we decided to save this “major change” for MPI-3 to make the adoption of MPI-2.2 faster). Since this proposal came first, it was used to define much of the process for MPI-3.0 and it was also probably scrutinized most :-). Actually, it seems rather simple but there were some subtle corner-cases that needed to be defined. But after all, it allows one to issue “immediate” (that’s where the “I” comes from) collective operations, such as:

MPI_Ibcast(buf, count, type, root, comm, &request);
... // compute
MPI_Wait(&request, &status);

This can be used to overlap computation and communication and enables several use-cases such as software pipelining (cf. Hoefler, Gottschling, Lumsdaine: “Leveraging Non-blocking Collective Communication in High-performance Applications”) or also interesting parallel protocols that require the nonblocking semantics (cf. Hoefler, Siebert, Lumsdaine: “Scalable Communication Protocols for Dynamic Sparse Data Exchange”).

A reference implementation is available with LibNBC and I’m looking forward to optimized platform-specific versions with full asynchronous progression! The latest (svn) version of MPICH2 is already supporting them (while some other MPI implementations are still working on MPI-2.2 compliance).

2) Neighborhood Collectives #258

Status: passed

Neighborhood (formerly aka. sparse) collective operations are extending the distributed graph and Cartesian process topologies with additional communication power. A user can now statically define a communication topology and also perform communication functions between neighbors in this topology! For example:


// create a 3d topology
MPI_Cart_create(comm, 3, {2,2,2}, {1,1,1}, 1, &newcomm);
... // read input data according to process order in newcomm
while(!converged) {
// start neighbor communication
MPI_Ineighbor_alltoall(..., &newcomm, &req);
... // compute inner parts
MPI_Wait(&req, MPI_STATUS_IGNORE);
... // compute outer parts
}

This obviously simplifies the MPI code quite a bit (compared to the old “north, south, west, east” exchanges with extra pack/send/recv/unpack code for each direction) and often improves performance. This can also be nicely combined with MPI datatypes (neighbor_alltoallw) to offer a very high abstraction level. Distributed graph communicators enable the specification of completely arbitrary communication relations. A more complex (application) example is described in Hoefler, Lorenzen, Lumsdaine: “Sparse Non-Blocking Collectives in Quantum Mechanical Calculations”.

The MPI implementation can optimize the topology and the message schedule for those functions in the graph or Cartesian communicator creation call. Optimization opportunities and a neighborhood_reduce call (which the Forum decided to remove from the proposal) are discussed in Hoefler, Traeff: “Sparse Collective Operations for MPI”.

3) Matched probe #38

Status: passed

One of the oldest tickets that we (Doug Gregor, who originally identified the problem when providing C# bindings for MPI, and I) proposed to MPI-2.2. It was deferred to MPI-3.0 for various reasons. This ticket fixes an old bug in MPI-2 where one could not probe for messages in a multi-threaded environment. The issue is somewhat subtle and complex to explain. For a good examples and a description of the complexity of the problem and the performance of the solution, refer to Hoefler, Bronevetsky, Barrett, de Supinski, Lumsdaine: “Efficient MPI Support for Advanced Hybrid Programming Models”.

The new interface works by removing the message at probe time from the matching queue and allowing the receiver to match it later with a special call:


MPI_Mprobe(source, tag, comm, &message, &status);
... // prepare buffer etc.
MPI_Mrecv(buf, count, type, &message, &status);

This avoids “bad” thread interleavings that lead to erroneous receives. Jeff has a good description of the problem in his blog.

4) MPIT Tool Interface #266

Status: passed

The new MPI tool interface allows the MPI implementation to expose certain internal variables, counters, and other states to the user (most likely performance tools). The huge difference to the various predecessor proposals is that it does not impose any specific structure or implementation choice (such as having an eager protocol) on the MPI implementations. Another side-effect of this is that is doesn’t really have to offer anything to the user :-). However, a “high quality” MPI implementation may use this interface to expose relevant state.

It will certainly be very useful for tools and advanced MPI users to investigate performance issues.

5) C Const Correctness #140

Status: passed

This sounds rather small but came with a major pain to pass it (anybody remembers why?). This ticket basically makes the C interface const-correct, i.e., adds the const qualifier to all C interface functions. All C++ functions already have const qualifiers.

This turns

int MPI_Gather(void* , int, MPI_Datatype, void*, int, MPI_Datatype, int, MPI_Comm);

into

int MPI_Gather(const void* , int, MPI_Datatype, void*, int, MPI_Datatype, int, MPI_Comm);

and thus allows several compiler optimizations and prevents some user errors (produces compiler warnings at least).

6) Updated One Sided Chapter #270

Status: passed

This proposal killed probably half of my free-time in the last years. It started at the Portland Forum meeting in 2009 where another group was proposing to essentially rewrite the MPI-2 One Sided chapter from scratch. I disagreed vehemently because the proposed text would not allow an implementation on systems that were not cache coherent. MPI-2 handled the cache coherency issue very elegantly but was in many places hard to use and even harder to understand.

After a night of re-writing the existing chapter to differentiate between two memory models (essentially “cache-coherent” and “not cache-coherent” in MPI lingo “unified” and “separate” public and private window) a new proposal was born. A subgroup started to bang on it (and erased essentially 80% of the ideas, replacing them with better ones!) and two years later we had what is probably the hairiest part of MPI (memory models are extremely complex). The RMA working group was a lot of fun, many inspiring discussions lead us to a good solution! This chapter was a good example of an excellent group effort!

The new chapter offers:

  • two memory models: one supporting cache-coherent systems (similar to many PGAS languages) and the other one is essentially the “old” MPI-2 model
  • different ordering modes for accumulate accesses (warning: the safe default mode may be easy to reason about but slower)
  • MPI_Win_allocate, a collective window creation function that allocates (potentially symmetric or specialized) memory for faster One Sided access
  • MPI_Win_create_dynamic, a mechanism to create a window that spans the whole address space together with functions to register (MPI_Win_attach) and deregister (MPI_Win_detach) memory locally
  • MPI_Get_accumulate, a fetch-and-accumulate function to atomically fetch and apply an operation to a variable
  • MPI_Fetch_and_op, a more specialized version of MPI_Get_accumulate with less parameters for atomic access to scalars only
  • MPI_Compare_and_swap, a CAS function as we know it from shared memory multiprogramming
  • MPI_R{put,get,accumulate,get_accumulate}, request-based MPI functions for local completion checking without window synchronization functions
  • MPI_Win_{un}lock_all, a function to (un)lock all processes in a window from a single process (not collective!)
  • MPI_Win_flush{_all}, a way to complete all outstanding operations to a specific target process (or all processes). Upon return of this function, the operation completed at the target (either in the private or public window copy)
  • MPI_Win_flush_local{_all}, a function to complete all operations locally to a specified process (or all processes). This does not include remote completion but local buffers can be re-used
  • conflicting accesses are now allowed but the outcome is undefined (and may corrupt the window). This is similar to the C++ memory model

Of course, nobody can understand the power of the new One Sided interface based on this small list without examples. The One Sided working group is working on more documentation and similar posts, I plan to link or mirror them here!

7) Allocating a Shared Memory Window #284

Status: read

Several groups wanted the ability to create shared memory in MPI. This would allow to share data-structures across all MPI processes in a multicore node similarly to OpenMP. However, unlike OpenMP, one would just share single objects (arrays etc.) and not the whole address space. The idea here is to combine this with One Sided and allow to create a window which is accessible with load/store (ISA) instructions to all participating processes.

This extends the already complex One Sided chapter (semantics) with the concept of local and remote memory. The proposal is still under discussion and may change. Currently, one can create such a window with MPI_Win_allocate_shared(size, info, comm, baseptr, win) and then use One Sided synchronization (flush and friends) to access it.

By default, the allocated memory is contiguous across process boundaries (process x’s memory starts after process x-1’s memory ends). The info argument alloc_shared_noncontig can be used to relax this and allow the implementation to allocate memory close to a process (on NUMA systems). Then, the user has to use the function MPI_Win_shared_query() to determine the base address of remote processes’ memory segment.

MPI-3.0 will also offer a special communicator split function that can be used to create a set of communicators which only include processes that can create a shared memory window (i.e., mutually share memory).

8 ) Noncollective Communicator Creation #286

Status: read

A very interesting proposal to allow a group of processes to create a communicator “on their own”, i.e., without involving the full parent communicator. This would ve very useful for MPI fault tolerance, where it could be used to “fix” a broken communicator (create a communicator with less processes). Compare this to Gropp, Lusk: “Fault Tolerance in MPI Programs”. This could be achieved with current functions but would be slow and cumbersome, see Dinan et al.: Noncollective Communicator Creation in MPI.

9) Nonblocking MPI_Comm_dup #168

Status: read

This very simple proposal allows to duplicate communicators in a nonblocking way. This allows to overlap the communicator creation latency and also implement “purely” nonblocking functions without initialization calls (cf. Hoefler, Snir: “Writing Parallel Libraries with MPI – Common Practice, Issues, and Extensions”). There is not much more to say about this simple call :-).

10) Fortran Bindings #229 (+24 more!)

Status: voted

This is a supposedly simple ticket that was developed to improve the Fortran bindings and add Fortran 2008 bindings. It offers type-safety and tries to resolve the issue with Fortran code movement (relying on Fortran TR 29113). I am not a Fortran expert (preferring C++ for scientific computing instead) so I can’t really speak to it. Jeff has a good post on this.

That’s it! Well, I am 100% sure that I forgot several proposals (some may even be still in the pipeline or below the radar) and I’ll add them here as they show up. We also already postponed several features to MPI-3.1, so the MPI train continues to run.

The Forum is also actively seeking feedback from the community. If you are interested in any of the described features, please give the draft standard a read and let us know if you have concerns (or praise :-))!

Torsten Hoefler

Some statistics for 2012 – my average speed was 17.4 km/h

I just looked at some statistics from 2011 :-).

I completed 670 tasks (that were enough effort to put them on my tasklist), about 2 tasks a day. I received (and read) 20555 emails (after filtering mailinglists and spam!), about 56 emails/day, I sent 5688 emails, about 16 emails/day. Way too many, it’s already consuming a significant fraction of my time! I flew a total of approximately 95.000 miles (~176.000 km), which made me travel at an average speed of 17.4 km/h over the year. I feel bad for the caused carbon footprint but this means that I also spent at least 300 hours (about 12 days) in planes (this is a lower bound estimate). Oh well! I hope 2012 gets quieter :-).

SJC TSA: “Flight crews on duty don’t go through the scanner”

Another WOW from TSA. So I was on my usual travel from San Jose back home. As usual, I refused to go through the millimeter wave detection scanner (also known as “cancer machine”). While being searched and patted down, I saw a full flight crew walking through the metal detector instead of the empty (!!!) scanner (there was no line at all).

They were also neither patted down nor searched. So I asked about that. The agent replied “Flight crews on duty never go through the scanner.”. First, I asked how they know that they’re flight crews (cf. “Catch me if you can”). Well, apparently a uniform and some form of plastic badge is ok (no list or barcode or anything). I didn’t get a reply to my question “why” (especially if there was no line!) :-).

I guess people dressed up in uniforms with some form of airline ID are much more trustworthy than other fellow travelers. Also, this may be some for of admitting that those scanners are affecting people’s health. Well well.

Now it’s official!

ETH also announced it in English and German.

Even the hpc-ch blog and Inside-IT.ch reported on it.

The ETH Board has appointed the following individuals as professors: Torsten Hoefler […], currently Adjunct Assistant Professor at the University of Illinois in Urbana-Champaign, USA, as Assistant Professor (Tenure Track) of Computational Science. Torsten Hoefler is internationally regarded as one of the leading young scientists in the field of high-performance computing. At the University of Illinois, he is currently involved in the development of one of the world’s most efficient supercomputers. His research interests focus on system design, programming and efficiency analysis. Torsten Hoefler will provide the Department of Computer Science, the research focus “Scientific Computing and Simulation” and the CSCS (Swiss National Supercomputing Centre) with important stimuli. […]

HPC and Supercomputing Conference and Journal Ranking

Ranking conferences and journals is indeed a complex task. Different metrics exist and a plethora of different free and commercial rankings exists. My favorite ranking so far was the AUS conference ranking that based mostly on opinions of researchers. While this is probably the best metric, it can be very biased (Australian researchers only?) and some conferences (such as Euro-Par) are just not listed. Another metric, the average citations per paper (often called “impact factor”) may be useful. While this may be biased towards older conferences (which may have higher rankings by this metric), it also shows how many active researchers follow a particular conference series.

I listed some conferences and journals that are in the HPC or Supercomputing field (I published in most of them) with their average citations per paper and their AUS ranking. This data was mainly for my own reference, however, several people asked me to publish this list, thus I will do this here. This list is not intended to be complete and, most importantly, the rankings do not represent my personal opinion (you have to ask me for this :-)); all data is based on the database from Microsoft Academic Search link queried on Dec. 3rd 2011. The age of the conference (as shown in MS Academic Search) is in brackets. Feel free to contact me if you think that anything is missing, incomplete or erroneous.

Conferences

Name avg. citations per paper (age) AUS Ranking
SPAA 16.0 (22) A
HPDC 14.7 (19) A
Supercomputing 14.4 (27) A
PPoPP 14.3 (27) A
ACM ICS 13.3 (33) A
PACT 12.8 (16) A
LCPC 11.5 (22) A
Hot Interconnects 10.5 (20) B
ICPP 6.6 (36) A
CCGRID 6.0 (10) A
IPDPS 5.2 (20) A
IEEE Cluster 4.1 (13) A
Euro-Par 3.3 (16) not rated
EuroMPI 3.2 (15) C
ISPDC 3.1 (15) C
PARA 2.4 (17) not rated
HIPC 2.4 (12) A
VECPAR 1.8 (15) B
ICCS 1.6 (11) A
PARCO 1.6 (18) C
ISPA 1.0 (08) B
HPCC 0.9 (17) B

Related Conferences

Name avg. citations per paper (age) AUS Ranking
SIGCOMM 40.8 (41) A
ISCA 30.8 (37) A
PODC 20.6 (28) A
SIGMETRICS 19.6 (38) not ranked
CGO 13.38 (8) not ranked
Comp. Frontiers 3.97 (6) not ranked
ARCS 1.2 (41) not ranked

Journals

Name avg. citations per paper (age) AUS Ranking
TPDS 14.3 (26) A*
IJHPCA 12.5 (24) B
COMPUTER 11.7 (41) B (not sure)
JPDC 9.0 (28) A*
CONCURRENCY 8.0 (22) A
Elsevier PARCO 6.9 (27) A
Cluster Computing 6.2 (18) not rated
FGCS 4.7 (27) A
Journal of Supercomputing 4.4 (27) B
PPL 4.2 (27) B
IJPEDS 2.6 (17) B
SIMPRA 1.7 (17) not rated

Black Friday

Well, Black Friday is one of those things one has to do while living in the US. I failed so far … but this year it shall be different (after being at a delicious thanksgiving meal in Chicago). The problem is that camping the night before on the street, waiting in line forever, and fighting for the very best deals isn’t quite my thing. But there is a good trick to avoid all this and still get reasonable deals if you’re living in a village like Champaign :-).

So I got a brand-new 32 inch Toshiba flat-screen TV for just $260 (for the Europeans, that’s about 190 EUR) instead of $340 … I think it’s an amazing deal. The way to get it without a line is to drive to a small shop in a small town which still has one or two Black Friday deals. In this case, it was a very small Radio Shack in Savoy, IL. Worked great!

toshiba

The TV even has two HDMI and one VGA input. Actually, I’m not really using it as TV anyway, just as a very large (and cheap) monitor :-).