Advanced MPI Programming Tutorial at Supercomputing 2013

Pavan Balaji, Jim Dinan, Rajeev Thakur and I are giving our Advanced MPI Programming tutorial at Supercomputing 2013 on Sunday November 17th.

Are you wondering about the new MPI-3 standard? How it affects you as a scientific or HPC programmer and what nice new features you can use to make your life easier and your application faster? Then you should not miss our tutorial.

Our abstract summarizes the main topics:

The vast majority of production parallel scientific applications today use MPI and run successfully on the largest systems in the world. For example, several MPI applications are running at full scale on the Sequoia system (on ?1.6 million cores) and achieving 12 to 14 petaflops/s of sustained performance. At the same time, the MPI standard itself is evolving (MPI-3 was released late last year) to address the needs and challenges of future extreme-scale platforms as well as applications. This tutorial will cover several advanced features of MPI, including new MPI-3 features, that can help users program modern systems effectively. Using code examples based on scenarios found in real applications, we will cover several topics including efficient ways of doing 2D and 3D stencil computation, derived datatypes, one-sided communication, hybrid (MPI + shared memory) programming, topologies and topology mapping, and neighborhood and nonblocking collectives. Attendees will leave the tutorial with an understanding of how to use these advanced features of MPI and guidelines on how they might perform on different
platforms and architectures.

This tutorial is about advanced use of MPI. It will cover several advanced features that are part of
MPI-1 and MPI-2 (derived datatypes, one-sided communication, thread support, topologies and topology
mapping) as well as new features that were recently added to MPI as part of MPI-3 (substantial additions
to the one-sided communication interface, neighborhood collectives, nonblocking collectives, support for
shared-memory programming).

Implementations of MPI-2 are widely available both from vendors and open-source projects. In addition,
the latest release of the MPICH implementation of MPI supports all of MPI-3. Vendor implementations
derived from MPICH will soon support these new features. As a result, users will be able to use in practice
what they learn in this tutorial.

The tutorial will be example driven, reflecting scenarios found in real applications. We will begin with
a 2D stencil computation with a 1D decomposition to illustrate simple Isend/Irecv based communication.

We will then use a 2D decomposition to illustrate the need for MPI derived datatypes. We will introduce
a simple performance model to demonstrate what performance can be expected and compare it with actual
performance measured on real systems. This model will be used to discuss, evaluate, and motivate the rest
of the tutorial.
We will use the same 2D stencil example to illustrate various ways of doing one-sided communication in
MPI and discuss the pros and cons of the different approaches as well as regular point-to-point communica-
tion. We will then discuss a 3D stencil without getting into complicated code details.
We will use examples of distributed linked lists and distributed locks to illustrate some of the new ad-
vanced one-sided communication features, such as the atomic read-modify-write operations.
We will discuss the support for threads and hybrid programming in MPI and provide two hybrid ver-
sions of the stencil example: MPI+OpenMP and MPI+MPI. The latter uses the new features in MPI-3 for
shared-memory programming. We will also discuss performance and correctness guidelines for hybrid pro-
gramming.

We will introduce process topologies, topology mapping, and the new “neighborhood” collective func-
tions added in MPI-3. These collectives are particularly intended to support stencil computations in a scalable
manner, both in terms of memory consumption and performance.
We will conclude with a discussion of other features in MPI-3 not explicitly covered in this tutorial
(interface for tools, Fortran 2008 bindings, etc.) as well as a summary of recent activities of the MPI Forum
beyond MPI-3.

Our planned agenda for the day is

  1. Introduction (8.30–10.00)
    • Background: What is MPI
    • MPI-1, MPI-2, MPI-3
    • 2D stencil code with 1D decomposition: Isend/Irecv version
    • 2D stencil code with 2D decomposition: Introduce derived datatypes
    • Introduce simple performance modeling and measurement
  2. One-Sided Communication (10.30–12.00)
    • Basics of one-sided communication or remote memory access (RMA)
    • 2D stencil code with 1D decomposition: RMA with 3 forms of synchronization
    • 3D stencil: What changes and what to pay attention to
    • Introduce other features of MPI-3 RMA
    • Linked list or distributed lock example demonstrating new MPI-3 RMA features
  3. Lunch (12.00–1.30)
  4. MPI and Threads (1.30–3.00)
    • What does the MPI standard specify about threads
    • How does it enable hybrid programming
    • Hybrid (MPI+OpenMP) version of 2D stencil code
    • Hybrid (MPI+MPI) version of 2D stencil code using MPI-3 shared-memory support
    • Performance and correctness guidelines for hybrid programming
  5. Topologies, Neighborhood/Nonblocking Collectives (3.30-5.00)
    • Topologies and topology mapping
    • 2D stencil code with 2D decomposition using neighborhood collectives
    • MPI-3 nonblocking collectives with example
    • Summary of other features in MPI-3
    • Summary of recent activities of the MPI Forum
    • Conclusions

We’re looking forward to many interesting discussions!

EuroMPI 2013 & Best Paper Award

EuroMPI is a very nice conference for the specialized sub-field MPI, namely the Message Passing Interface. I’m a long-term attendee since I’m working much on MPI and also standardization. We had a little more than 100 attendees this year in Madrid and the organization was just outstanding!

We were listening to 25 paper talks and five invited talks around MPI. For example Jesper Traeff, who discussed how to generalize datatypes towards collective operations:

Or Rajeev Thakur, who explained how we get to Exascale and that MPI is essentially ready:

Besides the many great talks, we also had some fun, like the city walking tour organized by the conference

the evening reception, a very nice networking event

or more networking in the Retiro park

followed by the traditional dinner.

On the last day, SPCL’s Timo Schneider presented our award-winning paper on runtime compilation for MPI datatypes

with a provocative start (there were many vendors in the room :-)

but an agreeing end.

The award ceremony followed right after the talk.

The conference was later closed by the announcement of next year, when EuroMPI will move to Japan (for the first time outside of Europe).

After all, a very nice conference! Kudos to the organizers.

The one weird thing about Madrid though … I got hit in the face by a random woman in the subway on my way back. Looks like she claimed I had stolen her seat (not sure why/how that happened and many other seats were empty) but she didn’t speak English and kept swearing at me. Weird people! :-)

First SPCL Excursion on Mount Rigi

Since the lab moved to Switzerland, we decided to do a hiking trip in the pre-alps. It’s incredibly beautiful and efficient (only 30 mins by car :-) ). We decided to climb mount Rigi wit Timo, Maciej, Tobias, and Natalia. Maciej, our local mountaineer, heroically volunteered to carry all food up :-) . Unfortunately, we picked a very hot day (about 35 Celsius). It was of course painful but a lot of fun at the same time :-) . Rigi is amazing and Switzerland is absolutely beautiful!

Here are some impressions of our first lab excursion.


This is our complete trip from top to bottom. More than 1km height difference on about 13 kms. Time to go up ~3:30hrs, time to go down 3:45hrs. Here are some more detailed stats pdf


Before we start! We still look fine :)


The first sign, btw., Swiss signage is generally horrible. Rigi is ok, we also had multiple cell phones and even multiple Internet connections :-) .


Mac carrying all the stuff, still smiling :-) .


First nice view at a lame altitude.


Getting better ….


First hut, still lame altitude. People are having beers here, so can’t be too hard!


Tracking …


And the path vanished somewhat :-/.


Better views.


Strange upwards path (rather steep).


Yes, Mac is alive.


Even better views, we gained some altitude.


Find Mac


Find Mac (easier)


Resti, 1198 meters.


Nice view (and find Mac)


Lame people take the train …


More views.


First rest – nice shade!


Timo doing well (with his new hat)


Water :-)


Maybe a bit too much.


The whole path was full of barbed wire … it’ll entangle you and rescue you if you fall down :-) .


Yes, even at tight passages … a bit weird.


Discussing animals :-) .


Some locals.


Birds coming rather close. I didn’t have the energy to get the good lens out of the bag, sorry!


More views. The right one could be our next target!


The top … 3hrs upstairs.


And then the lame surprise … completely commercialized, a train station and tons of overweight tourists. Btw., we saw *NOBODY* walking up even though we moved slowly. Some people were coming down but they looked way too fresh (they must have taken the train up).


The train.


Very nice unobstructed views towards Germany.


Lake Zug (I think).


The alps (more hiking!).


Timo found a friend.


Love at first sight (look at the ears).


Our new lab member, donk.


Yes, we made it!


Military drill to pass electrified fences :-) .


Views ….


More locals watching us.


Our hickory fridge. So it was 35 Celsius and we walked for four hours or more. How do you prevent your barbecue meat from going bad? Well, Mac had the rescuing idea: deep-freeze everything and pack six bottles of frozen water into a thermo-bag. It lasted very well, all meat was still slightly frozen. Even better, we had nicely chilled water for the way back :-) .


Professional barbecue equipment :) .


Food.


More food.


Cervelas on the fire.


And Hamburgers on the grill.


Goooooood!


Back at the fun :-) .


Good teamwork.


Beautoful views all over.


.. and everything prepared to survive.


On the way back.


More views.


Don’t fall :-) .


Find Tobias and Timo.


Making contact with the locals.


Ohoh …


Is the cow laughing?


Ok, it’s not hostile.


Nature.


Getting late … but still ok.


The target!

And done … it was a rather nice trip. We may do it again :-) . Thanks all for making it so much fun!

Alps view

Since I’m living slightly north of Zurich, nobody believes me when I’m claiming that I can see the alps from my bed. But well, it’s true. I have to admit that they’re rather small and about 60 kms away but I can indeed look right into the center of the alps! The valley between Altholz and Buechholz (which is full of great mushrooms) is nicely aligned and enables the view.

Here a coarse map:
coarse_view

and a finer one:
fine_view
(credits: Google maps)

The two visible mountains are Gross Ruchen and Gross Windgaellen. Well, they’re visible every two weeks, when the weather is good :-) . I’ll see of I can take a picture next time.

Jealous?

You know that a program committee failed if …

I had the worst experience with conference reviews in my short scientific career (I’ll not name names but it’s an “A” ranked conference with a reasonable reputation, you better ask me over a beer). I’m trying to take it with humor and share some of the funniest parts here.

So, you know that a program committee failed if …

  1. You receive a one-paragraph review which goes like (this is an original citation, only the scheme has been replaced to guarantee anonymity of the venue):

    “This paper proposes [technique X]. It is a good idea to use [X]. However, it is difficult to understand that [X works in context Y].”

    Yes, that’s it! The final evaluation was a weak accept.

  2. You submit a paper on a programming environment for HPC and you get a comment like:

    “The importance to the field is fair because programmers are easily able to exploit the optimizations to achieve the better execution time of real applications because the optimization can be reused through standard MPI API and the authors showed the speed-up of real applications including [application X].”

    Yes, the system is considered bad if it’s easy to use, portable and backwards compatible :-) . Reminds me of “Parallel machines are hard to program and we should make them even harder – to keep the riff-raff off them.”

  3. Your paper receives the scores accept, accept, weak accept, and reject with three reasonable, in fact nice, and encouraging reviews. The reject review is completely unreasonable and it criticizes the writing style while having at least one or two English mistakes in *every single sentence* :-) .
  4. You receive (3) and a completely unnecessary and offensive sentence at the end of the reject review which says “The only good thing in this paper is [X]” where [X] is absolutely unrelated (and in fact not even existing or reasonably conceivable).
  5. You received (4) and rebutted the hell out of this completely unreasonable review (which wasn’t even consistent in itself in addition to being offensive). Assume the rebuttal took you a day since you had to interpret the review’s twisted English and strange criticisms and rebut it in a technical and polite way (which seems hard); AND the rebuttal was *completely* ignored, i.e., neither the review was updated nor did you receive a note from the chair about what happened.

  6. You call up some friends who attended the TPC meeting and (1)-(5) are reinforced.

So after all, there is now one conference more that I may not recommend to anyone for a while. On the other hand, I may be spoiled since I received just absolutely outstanding reviews for the submissions before that (where not all were accepted, but most :-) ).

DFSSSP: Fast (high-bandwidth) Deadlock-Free Routing for InfiniBand Networks

The Open Fabrics Alliance just released a new version of then Open Subnet Manager including our Deadlock-Free SSSP Routing for InfiniBand (DFSSSP) routing algorithm [2]!

This new version fixes several minor bugs, adds the support for base/enhanced switch port 0 and improves the routing performance further, but lacks support for multicast routing (see ‘Update’ below).

DFSSSP is a new routing algorithm that can be used to route InfiniBand networks with OpenSM 3.3.16 [1] and later. It performs generally better than the default Min Hop algorithm and avoids deadlocks by routing through different virtual lanes (VLs). Due to the above-mentioned problems, we don’t recommend to use the DFSSSP routing algorithm which is included in the OFED 3.2 and 3.5 releases.

DFSSSP can lead to up to 50% higher routing performance for dense (bisection-limited) communication patterns, such as all-to-all and thus directly accelerates dense communication applications such as the Graph500 benchmark [4]. The following figure shows a direct comparison with other routing algorithms on a 726 node cluster running MPI with 1 process (in the 1024 process case, some nodes have two processes) per node:

netgauge_deimos

This comparison uses Netgauge’s effective bisection bandwidth benchmark, an approximation of the real bisection bandwidth of a network.

MPI_Alltoall performance is similarly improved over Min Hop and LASH routing as can be observed in the following figure (using 128 nodes):

mpialltoall

The new DFSSSP algorithm can be used with OpenSM version 3.3.16 starting it with ‘-R dfsssp’ on the command line or setting ‘routing_engine dfsssp’ in the configuration file. Despite the configuration of the routing algorithm, you will have to enable QoS with an uniform distribution (see [A1]) and you will have to enable service level query support within your MPI environment (see [A2] for OpenMPI).

You should compare the bandwidth yourself. Effective bisection bandwidth and all-to-all can be measured with Netgauge, however, real application measurements are always best!

Now you may be wondering why DFSSSP is faster than Min Hop since Min Hop is already minimizing the number of hops between all endpoints. The trick is that DFSSSP optimizes the *global bandwidth* in addition to the distance between endpoints. This is achieved with a simple greedy algorithm described in detail in [3]. Deadlock-freedom is then added by using different virtual lanes for the communication as described in [2]. By the way, Min Hop does not guarantee deadlock freedom! If you want to know more, read [2] and [3] or come to the HPC Advisory Council Switzerland Conference 2013 conference in March where I’ll give a talk about the principles behind DFSSSP and how to use it in practice.

DFSSSP is developed in collaboration between the main developer Jens Domke at the Tokio Institute of Technology, and Torsten Hoefler of the Scalable Parallel Computing Lab at ETH Zurich.

[1]: opensm-3.3.16.patched.tar.gz
[2]: J. Domke, T. Hoefler and W. Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
[3]: T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks
[4]: Graph 500: www.graph500.org
[5]: openmpi-1.6.4.patched.tar.gz

[A1] Possible QoS configuration for OpenSM + DFSSSP with 8 VLs:

qos TRUE
qos_max_vls 8
qos_high_limit 4
qos_vlarb_high 0:64,1:64,2:64,3:64,4:64,5:64,6:64,7:64
qos_vlarb_low 0:4,1:4,2:4,3:4,4:4,5:4,6:4,7:4
qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7

[A2] Enable SL queries for the path setup within OpenMPI:

a) configure OpenMPI with "--with-openib --enable-openib-dynamic-sl"
b) run your application with "--mca btl_openib_ib_path_record_service_level 1"

PS: we experienced some trouble with old HCA firmware, which did not support sending userspace MAD request on VLs other than 0.
You can test the following command (as root) on some nodes and see if you get an response from the subnet manager:

saquery -p --src-to-dst LID1:LID2

In case the command stalls or returns with an error you might have to update the firmware.

Update:
The fix for multicast routing has been implemented and tested. Please, use our patched version of opensm-3.3.16 (see [1]) instead of the default version from the OFED websites. Besides the multicast patch, this version contains a slightly enhanced implementation of the VL balancing. Future releases by the Open Fabrics Alliance (>= 3.3.17) will be shipped with both patches.
Besides the multicast problem, we have identified a bug in OpenMPI related to the connection management of the openib BTL. We provide a patched version of OpenMPI as well (see [5]).

Rosca de Reyes

My Mexican friend Edgar told me about a Mexican tradition: to eat “Rosca de Reyes” (Kings Cake) on the 6th. Since I really enjoy Mexican food and traditions, we just made one! Here’s how it went:

cake1
The dough :-)

cake2
The “rosca”?

cake3
After the “going” (it grew quite a bit (too much?)).

cake4
And the final product! It tasted awesome!

Happy king’s day :-) .

MPI-3.0 (mostly) finalized — public draft available

Finally, after the last meeting in Chicago a couple of weeks ago and some more minor edits, we (the MPI Forum) were able to release our first public draft of MPI-3.0. Jeff has an explanation why it took a while :-) .



mpi3v2

This draft includes everything that has been voted into MPI-3.0. The standard is closed for major changes, so all features are in place. We put out this draft to the public to allow for comments until we vote on finally (on each chapter) in the September meeting. We plan to ratify the standard at that meeting if there are no other delays (one never knows!).

Nevertheless, we remain open for changes (especially bugs) and feature requests (which will go into future versions, e.g., MPI-3.1). Minor changes, that can still influence MPI-3.0 include any kind of bug in the released document (minor or major) or small explanations and additions that don’t change semantics significantly. However, we’re trying to keep the changes to the document minimal, so only absolutely necessary changes will be considered.

The draft is available at http://meetings.mpi-forum.org/draft_standard/mpi3.0_draft_2.pdf.

If you find issues, either contact the Forum member of your choice or broadcast to a larger group via the mailinglist (http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-comments, you need to be subscribed). We accept comments until September 6th and we thank you for your support!

Happy reviewing and exploring of all the new features! PS: my earlier posts on the new MPI-3.0 features remain completely valid and provide a good overview of the news in MPI-3.0. If you are interested to learn more about MPI-3, you may also consider to look through some of the tutorial materials that I and Martin provided! We are planning to offer this tutorial at future conferences, so watch out!