EuroMPI 2013 & Best Paper Award

EuroMPI is a very nice conference for the specialized sub-field MPI, namely the Message Passing Interface. I’m a long-term attendee since I’m working much on MPI and also standardization. We had a little more than 100 attendees this year in Madrid and the organization was just outstanding!

We were listening to 25 paper talks and five invited talks around MPI. For example Jesper Traeff, who discussed how to generalize datatypes towards collective operations:

Or Rajeev Thakur, who explained how we get to Exascale and that MPI is essentially ready:

Besides the many great talks, we also had some fun, like the city walking tour organized by the conference

the evening reception, a very nice networking event

or more networking in the Retiro park

followed by the traditional dinner.

On the last day, SPCL’s Timo Schneider presented our award-winning paper on runtime compilation for MPI datatypes

with a provocative start (there were many vendors in the room :-)

but an agreeing end.

The award ceremony followed right after the talk.

The conference was later closed by the announcement of next year, when EuroMPI will move to Japan (for the first time outside of Europe).

After all, a very nice conference! Kudos to the organizers.

The one weird thing about Madrid though … I got hit in the face by a random woman in the subway on my way back. Looks like she claimed I had stolen her seat (not sure why/how that happened and many other seats were empty) but she didn’t speak English and kept swearing at me. Weird people! :-)

First SPCL Excursion on Mount Rigi

Since the lab moved to Switzerland, we decided to do a hiking trip in the pre-alps. It’s incredibly beautiful and efficient (only 30 mins by car :-) ). We decided to climb mount Rigi wit Timo, Maciej, Tobias, and Natalia. Maciej, our local mountaineer, heroically volunteered to carry all food up :-) . Unfortunately, we picked a very hot day (about 35 Celsius). It was of course painful but a lot of fun at the same time :-) . Rigi is amazing and Switzerland is absolutely beautiful!

Here are some impressions of our first lab excursion.

This is our complete trip from top to bottom. More than 1km height difference on about 13 kms. Time to go up ~3:30hrs, time to go down 3:45hrs. Here are some more detailed stats pdf

Before we start! We still look fine :)

The first sign, btw., Swiss signage is generally horrible. Rigi is ok, we also had multiple cell phones and even multiple Internet connections :-) .

Mac carrying all the stuff, still smiling :-) .

First nice view at a lame altitude.

Getting better ….

First hut, still lame altitude. People are having beers here, so can’t be too hard!

Tracking …

And the path vanished somewhat :-/.

Better views.

Strange upwards path (rather steep).

Yes, Mac is alive.

Even better views, we gained some altitude.

Find Mac

Find Mac (easier)

Resti, 1198 meters.

Nice view (and find Mac)

Lame people take the train …

More views.

First rest – nice shade!

Timo doing well (with his new hat)

Water :-)

Maybe a bit too much.

The whole path was full of barbed wire … it’ll entangle you and rescue you if you fall down :-) .

Yes, even at tight passages … a bit weird.

Discussing animals :-) .

Some locals.

Birds coming rather close. I didn’t have the energy to get the good lens out of the bag, sorry!

More views. The right one could be our next target!

The top … 3hrs upstairs.

And then the lame surprise … completely commercialized, a train station and tons of overweight tourists. Btw., we saw *NOBODY* walking up even though we moved slowly. Some people were coming down but they looked way too fresh (they must have taken the train up).

The train.

Very nice unobstructed views towards Germany.

Lake Zug (I think).

The alps (more hiking!).

Timo found a friend.

Love at first sight (look at the ears).

Our new lab member, donk.

Yes, we made it!

Military drill to pass electrified fences :-) .

Views ….

More locals watching us.

Our hickory fridge. So it was 35 Celsius and we walked for four hours or more. How do you prevent your barbecue meat from going bad? Well, Mac had the rescuing idea: deep-freeze everything and pack six bottles of frozen water into a thermo-bag. It lasted very well, all meat was still slightly frozen. Even better, we had nicely chilled water for the way back :-) .

Professional barbecue equipment :) .


More food.

Cervelas on the fire.

And Hamburgers on the grill.


Back at the fun :-) .

Good teamwork.

Beautoful views all over.

.. and everything prepared to survive.

On the way back.

More views.

Don’t fall :-) .

Find Tobias and Timo.

Making contact with the locals.

Ohoh …

Is the cow laughing?

Ok, it’s not hostile.


Getting late … but still ok.

The target!

And done … it was a rather nice trip. We may do it again :-) . Thanks all for making it so much fun!

Alps view

Since I’m living slightly north of Zurich, nobody believes me when I’m claiming that I can see the alps from my bed. But well, it’s true. I have to admit that they’re rather small and about 60 kms away but I can indeed look right into the center of the alps! The valley between Altholz and Buechholz (which is full of great mushrooms) is nicely aligned and enables the view.

Here a coarse map:

and a finer one:
(credits: Google maps)

The two visible mountains are Gross Ruchen and Gross Windgaellen. Well, they’re visible every two weeks, when the weather is good :-) . I’ll see of I can take a picture next time.


You know that a program committee failed if …

I had the worst experience with conference reviews in my short scientific career (I’ll not name names but it’s an “A” ranked conference with a reasonable reputation, you better ask me over a beer). I’m trying to take it with humor and share some of the funniest parts here.

So, you know that a program committee failed if …

  1. You receive a one-paragraph review which goes like (this is an original citation, only the scheme has been replaced to guarantee anonymity of the venue):

    “This paper proposes [technique X]. It is a good idea to use [X]. However, it is difficult to understand that [X works in context Y].”

    Yes, that’s it! The final evaluation was a weak accept.

  2. You submit a paper on a programming environment for HPC and you get a comment like:

    “The importance to the field is fair because programmers are easily able to exploit the optimizations to achieve the better execution time of real applications because the optimization can be reused through standard MPI API and the authors showed the speed-up of real applications including [application X].”

    Yes, the system is considered bad if it’s easy to use, portable and backwards compatible :-) . Reminds me of “Parallel machines are hard to program and we should make them even harder – to keep the riff-raff off them.”

  3. Your paper receives the scores accept, accept, weak accept, and reject with three reasonable, in fact nice, and encouraging reviews. The reject review is completely unreasonable and it criticizes the writing style while having at least one or two English mistakes in *every single sentence* :-) .
  4. You receive (3) and a completely unnecessary and offensive sentence at the end of the reject review which says “The only good thing in this paper is [X]” where [X] is absolutely unrelated (and in fact not even existing or reasonably conceivable).
  5. You received (4) and rebutted the hell out of this completely unreasonable review (which wasn’t even consistent in itself in addition to being offensive). Assume the rebuttal took you a day since you had to interpret the review’s twisted English and strange criticisms and rebut it in a technical and polite way (which seems hard); AND the rebuttal was *completely* ignored, i.e., neither the review was updated nor did you receive a note from the chair about what happened.

  6. You call up some friends who attended the TPC meeting and (1)-(5) are reinforced.

So after all, there is now one conference more that I may not recommend to anyone for a while. On the other hand, I may be spoiled since I received just absolutely outstanding reviews for the submissions before that (where not all were accepted, but most :-) ).

DFSSSP: Fast (high-bandwidth) Deadlock-Free Routing for InfiniBand Networks

The Open Fabrics Alliance just released a new version of then Open Subnet Manager including our Deadlock-Free SSSP Routing for InfiniBand (DFSSSP) routing algorithm [2]!

This new version fixes several minor bugs, adds the support for base/enhanced switch port 0 and improves the routing performance further, but lacks support for multicast routing (see ‘Update’ below).

DFSSSP is a new routing algorithm that can be used to route InfiniBand networks with OpenSM 3.3.16 [1] and later. It performs generally better than the default Min Hop algorithm and avoids deadlocks by routing through different virtual lanes (VLs). Due to the above-mentioned problems, we don’t recommend to use the DFSSSP routing algorithm which is included in the OFED 3.2 and 3.5 releases.

DFSSSP can lead to up to 50% higher routing performance for dense (bisection-limited) communication patterns, such as all-to-all and thus directly accelerates dense communication applications such as the Graph500 benchmark [4]. The following figure shows a direct comparison with other routing algorithms on a 726 node cluster running MPI with 1 process (in the 1024 process case, some nodes have two processes) per node:


This comparison uses Netgauge’s effective bisection bandwidth benchmark, an approximation of the real bisection bandwidth of a network.

MPI_Alltoall performance is similarly improved over Min Hop and LASH routing as can be observed in the following figure (using 128 nodes):


The new DFSSSP algorithm can be used with OpenSM version 3.3.16 starting it with ‘-R dfsssp’ on the command line or setting ‘routing_engine dfsssp’ in the configuration file. Despite the configuration of the routing algorithm, you will have to enable QoS with an uniform distribution (see [A1]) and you will have to enable service level query support within your MPI environment (see [A2] for OpenMPI).

You should compare the bandwidth yourself. Effective bisection bandwidth and all-to-all can be measured with Netgauge, however, real application measurements are always best!

Now you may be wondering why DFSSSP is faster than Min Hop since Min Hop is already minimizing the number of hops between all endpoints. The trick is that DFSSSP optimizes the *global bandwidth* in addition to the distance between endpoints. This is achieved with a simple greedy algorithm described in detail in [3]. Deadlock-freedom is then added by using different virtual lanes for the communication as described in [2]. By the way, Min Hop does not guarantee deadlock freedom! If you want to know more, read [2] and [3] or come to the HPC Advisory Council Switzerland Conference 2013 conference in March where I’ll give a talk about the principles behind DFSSSP and how to use it in practice.

DFSSSP is developed in collaboration between the main developer Jens Domke at the Tokio Institute of Technology, and Torsten Hoefler of the Scalable Parallel Computing Lab at ETH Zurich.

[1]: opensm-3.3.16.patched.tar.gz
[2]: J. Domke, T. Hoefler and W. Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
[3]: T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks
[4]: Graph 500:
[5]: openmpi-1.6.4.patched.tar.gz

[A1] Possible QoS configuration for OpenSM + DFSSSP with 8 VLs:

qos TRUE
qos_max_vls 8
qos_high_limit 4
qos_vlarb_high 0:64,1:64,2:64,3:64,4:64,5:64,6:64,7:64
qos_vlarb_low 0:4,1:4,2:4,3:4,4:4,5:4,6:4,7:4
qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7

[A2] Enable SL queries for the path setup within OpenMPI:

a) configure OpenMPI with "--with-openib --enable-openib-dynamic-sl"
b) run your application with "--mca btl_openib_ib_path_record_service_level 1"

PS: we experienced some trouble with old HCA firmware, which did not support sending userspace MAD request on VLs other than 0.
You can test the following command (as root) on some nodes and see if you get an response from the subnet manager:

saquery -p --src-to-dst LID1:LID2

In case the command stalls or returns with an error you might have to update the firmware.

The fix for multicast routing has been implemented and tested. Please, use our patched version of opensm-3.3.16 (see [1]) instead of the default version from the OFED websites. Besides the multicast patch, this version contains a slightly enhanced implementation of the VL balancing. Future releases by the Open Fabrics Alliance (>= 3.3.17) will be shipped with both patches.
Besides the multicast problem, we have identified a bug in OpenMPI related to the connection management of the openib BTL. We provide a patched version of OpenMPI as well (see [5]).

Rosca de Reyes

My Mexican friend Edgar told me about a Mexican tradition: to eat “Rosca de Reyes” (Kings Cake) on the 6th. Since I really enjoy Mexican food and traditions, we just made one! Here’s how it went:

The dough :-)

The “rosca”?

After the “going” (it grew quite a bit (too much?)).

And the final product! It tasted awesome!

Happy king’s day :-) .

MPI-3.0 (mostly) finalized — public draft available

Finally, after the last meeting in Chicago a couple of weeks ago and some more minor edits, we (the MPI Forum) were able to release our first public draft of MPI-3.0. Jeff has an explanation why it took a while :-) .


This draft includes everything that has been voted into MPI-3.0. The standard is closed for major changes, so all features are in place. We put out this draft to the public to allow for comments until we vote on finally (on each chapter) in the September meeting. We plan to ratify the standard at that meeting if there are no other delays (one never knows!).

Nevertheless, we remain open for changes (especially bugs) and feature requests (which will go into future versions, e.g., MPI-3.1). Minor changes, that can still influence MPI-3.0 include any kind of bug in the released document (minor or major) or small explanations and additions that don’t change semantics significantly. However, we’re trying to keep the changes to the document minimal, so only absolutely necessary changes will be considered.

The draft is available at

If you find issues, either contact the Forum member of your choice or broadcast to a larger group via the mailinglist (, you need to be subscribed). We accept comments until September 6th and we thank you for your support!

Happy reviewing and exploring of all the new features! PS: my earlier posts on the new MPI-3.0 features remain completely valid and provide a good overview of the news in MPI-3.0. If you are interested to learn more about MPI-3, you may also consider to look through some of the tutorial materials that I and Martin provided! We are planning to offer this tutorial at future conferences, so watch out!

Moving Part 2 – nearly got killed!

Oh well, who thought that a move could actually be dangerous?

Well, first, this is the first move in my life where I had to throw away most of my things. Well, we could have taken them to Switzerland but I thought they’re not quite valuable enough to justify the transport costs (~$8000). The only problem in this equation is that my employer would have paid the move but I will now pay for the new acquisitions. Oh well, I guess I’m just nice!

I disposed two full industry-grade (apartment) trashcans like this:

A lot of good stuff :-( . But most of this stuff was still from my student time, so time to get new things!

The apartment was so clean after the move that our landlord even gave us $5 more than our deposit with the comment “that is the first apartment I don’t have to clean” :-) .

Everything else was packed into a bunch of suitcases (will be fun to check them all). So a lot of nice luggage:

But before leaving the US, I had to go to the Blue Waters Extreme Scaling Workshop and the MPI Forum. Meaning to haul the luggage around a couple of times … but also life-threatening danger.

So the scaling workshop (where I’m at right now) is held in the northwest suburbs (Des Plaines). At the evening of the first day, we decided to walk a bit around to catch some fresh air. It was 9pm-ish. Well, we shouldn’t have gone into that sided road where a group of about 10-15 people were walking on the sidewalk. Obviously some kind of hispanic gang (like in the movies, seriously). So while we discussed how we avoid them (there was no other street side), a black SUV drove by, and suddenly (without apparent reason), the gang started to throw stones and other things at the car. Seriously!? Right in front of us (10 meters away). Since they were running after the car (on the street), it solved our problem of avoiding them :-) . But still, that was a $4000 damage right there. And if this wasn’t enough, the car was coming back (!?) and got another round of stones (don’t ask) and the gang was running in our direction before they dispersed. I guess we were right in the middle of a gang fight. No shots fired, yet. The car stopped in front of us and we decided to go back (the stone-throwers were gone).

Police was at the scene seconds later … on the way back to the hotel, we heard a shot and a bullet deflection sounds (most likely metal) on the same street. Oh man, don’t go for a walk in the suburbs.

More to come … off to the MPI Forum tomorrow.