htor | Torsten Hoefler's blog

May182013

Alps view

Since I’m living slightly north of Zurich, nobody believes me when I’m claiming that I can see the alps from my bed. But well, it’s true. I have to admit that they’re rather small and about 60 kms away but I can indeed look right into the center of the alps! The valley between Altholz and Buechholz (which is full of great mushrooms) is nicely aligned and enables the view.

Here a coarse map:

and a finer one:

(credits: Google maps)

The two visible mountains are Gross Ruchen and Gross Windgaellen. Well, they’re visible every two weeks, when the weather is good :-). I’ll see of I can take a picture next time.

Jealous?

Apr22013

You know that a program committee failed if …

htorScience

I had the worst experience with conference reviews in my short scientific career (I’ll not name names but it’s an “A” ranked conference with a reasonable reputation, you better ask me over a beer). I’m trying to take it with humor and share some of the funniest parts here.

So, you know that a program committee failed if …

You receive a one-paragraph review which goes like (this is an original citation, only the scheme has been replaced to guarantee anonymity of the venue):
“This paper proposes [technique X]. It is a good idea to use [X]. However, it is difficult to understand that [X works in context Y].”

Yes, that’s it! The final evaluation was a weak accept.
You submit a paper on a programming environment for HPC and you get a comment like:
“The importance to the field is fair because programmers are easily able to exploit the optimizations to achieve the better execution time of real applications because the optimization can be reused through standard MPI API and the authors showed the speed-up of real applications including [application X].”

Yes, the system is considered bad if it’s easy to use, portable and backwards compatible :-). Reminds me of “Parallel machines are hard to program and we should make them even harder – to keep the riff-raff off them.”
Your paper receives the scores accept, accept, weak accept, and reject with three reasonable, in fact nice, and encouraging reviews. The reject review is completely unreasonable and it criticizes the writing style while having at least one or two English mistakes in *every single sentence* :-).
You receive (3) and a completely unnecessary and offensive sentence at the end of the reject review which says “The only good thing in this paper is [X]” where [X] is absolutely unrelated (and in fact not even existing or reasonably conceivable).
You received (4) and rebutted the hell out of this completely unreasonable review (which wasn’t even consistent in itself in addition to being offensive). Assume the rebuttal took you a day since you had to interpret the review’s twisted English and strange criticisms and rebut it in a technical and polite way (which seems hard); AND the rebuttal was *completely* ignored, i.e., neither the review was updated nor did you receive a note from the chair about what happened.
You call up some friends who attended the TPC meeting and (1)-(5) are reinforced.

So after all, there is now one conference more that I may not recommend to anyone for a while. On the other hand, I may be spoiled since I received just absolutely outstanding reviews for the submissions before that (where not all were accepted, but most :-)).

Feb192013

DFSSSP: Fast (high-bandwidth) Deadlock-Free Routing for InfiniBand Networks

htorHPC • MPI • Science

The Open Fabrics Alliance just released a new version of then Open Subnet Manager including our Deadlock-Free SSSP Routing for InfiniBand (DFSSSP) routing algorithm [2]!

This new version fixes several minor bugs, adds the support for base/enhanced switch port 0 and improves the routing performance further, but lacks support for multicast routing (see ‘Update’ below).

DFSSSP is a new routing algorithm that can be used to route InfiniBand networks with OpenSM 3.3.16 [1] and later. It performs generally better than the default Min Hop algorithm and avoids deadlocks by routing through different virtual lanes (VLs). Due to the above-mentioned problems, we don’t recommend to use the DFSSSP routing algorithm which is included in the OFED 3.2 and 3.5 releases.

DFSSSP can lead to up to 50% higher routing performance for dense (bisection-limited) communication patterns, such as all-to-all and thus directly accelerates dense communication applications such as the Graph500 benchmark [4]. The following figure shows a direct comparison with other routing algorithms on a 726 node cluster running MPI with 1 process (in the 1024 process case, some nodes have two processes) per node:

netgauge_deimos

This comparison uses Netgauge’s effective bisection bandwidth benchmark, an approximation of the real bisection bandwidth of a network.

MPI_Alltoall performance is similarly improved over Min Hop and LASH routing as can be observed in the following figure (using 128 nodes):

mpialltoall

The new DFSSSP algorithm can be used with OpenSM version 3.3.16 starting it with ‘-R dfsssp’ on the command line or setting ‘routing_engine dfsssp’ in the configuration file. Despite the configuration of the routing algorithm, you will have to enable QoS with an uniform distribution (see [A1]) and you will have to enable service level query support within your MPI environment (see [A2] for OpenMPI).

You should compare the bandwidth yourself. Effective bisection bandwidth and all-to-all can be measured with Netgauge, however, real application measurements are always best!

Now you may be wondering why DFSSSP is faster than Min Hop since Min Hop is already minimizing the number of hops between all endpoints. The trick is that DFSSSP optimizes the *global bandwidth* in addition to the distance between endpoints. This is achieved with a simple greedy algorithm described in detail in [3]. Deadlock-freedom is then added by using different virtual lanes for the communication as described in [2]. By the way, Min Hop does not guarantee deadlock freedom! If you want to know more, read [2] and [3] or come to the HPC Advisory Council Switzerland Conference 2013 conference in March where I’ll give a talk about the principles behind DFSSSP and how to use it in practice.

DFSSSP is developed in collaboration between the main developer Jens Domke at the Tokio Institute of Technology, and Torsten Hoefler of the Scalable Parallel Computing Lab at ETH Zurich.

[1]: opensm-3.3.16.patched.tar.gz
[2]: J. Domke, T. Hoefler and W. Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
[3]: T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks
[4]: Graph 500: www.graph500.org
[5]: openmpi-1.6.4.patched.tar.gz

[A1] Possible QoS configuration for OpenSM + DFSSSP with 8 VLs:
qos TRUE qos_max_vls 8 qos_high_limit 4 qos_vlarb_high 0:64,1:64,2:64,3:64,4:64,5:64,6:64,7:64 qos_vlarb_low 0:4,1:4,2:4,3:4,4:4,5:4,6:4,7:4 qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7

[A2] Enable SL queries for the path setup within OpenMPI:
a) configure OpenMPI with "--with-openib --enable-openib-dynamic-sl" b) run your application with "--mca btl_openib_ib_path_record_service_level 1"

PS: we experienced some trouble with old HCA firmware, which did not support sending userspace MAD request on VLs other than 0.
You can test the following command (as root) on some nodes and see if you get an response from the subnet manager:
saquery -p --src-to-dst LID1:LID2
In case the command stalls or returns with an error you might have to update the firmware.

Update:
The fix for multicast routing has been implemented and tested. Please, use our patched version of opensm-3.3.16 (see [1]) instead of the default version from the OFED websites. Besides the multicast patch, this version contains a slightly enhanced implementation of the VL balancing. Future releases by the Open Fabrics Alliance (>= 3.3.17) will be shipped with both patches.
Besides the multicast problem, we have identified a bug in OpenMPI related to the connection management of the openib BTL. We provide a patched version of OpenMPI as well (see [5]).

Jan62013

Rosca de Reyes

htorUncategorized

My Mexican friend Edgar told me about a Mexican tradition: to eat “Rosca de Reyes” (Kings Cake) on the 6th. Since I really enjoy Mexican food and traditions, we just made one! Here’s how it went:

cake1
The dough 🙂

cake2
The “rosca”?

cake3
After the “going” (it grew quite a bit (too much?)).

cake4
And the final product! It tasted awesome!

Happy king’s day :-).

Dec112012

Three new ACM fellows with strong ties to HPC

htorUncategorized

Three of the 53 2012 ACM fellows have a strong HPC background!

Congratulations to Keshav Pengali, Robert Schreiber, and Kathy Yelick!

A good sign for the small but growing field!

Sep212012

MPI-3.0 unanimously ratified today!

htorUncategorized

Actually, everything said … see my earlier posts for the features included (and not included). I also gave a keynote at the Multicore Challenge Conference last Tuesday talking about the Challenges that MPI-3.0 reacted to (talk slides).

The complete official standard can be found at http://www.mpi-forum.org/docs/docs.html .

Kudos MPI Forum!

Aug32012

MPI-3.0 (mostly) finalized — public draft available

htorUncategorized

Finally, after the last meeting in Chicago a couple of weeks ago and some more minor edits, we (the MPI Forum) were able to release our first public draft of MPI-3.0. Jeff has an explanation why it took a while :-).

This draft includes everything that has been voted into MPI-3.0. The standard is closed for major changes, so all features are in place. We put out this draft to the public to allow for comments until we vote on finally (on each chapter) in the September meeting. We plan to ratify the standard at that meeting if there are no other delays (one never knows!).

Nevertheless, we remain open for changes (especially bugs) and feature requests (which will go into future versions, e.g., MPI-3.1). Minor changes, that can still influence MPI-3.0 include any kind of bug in the released document (minor or major) or small explanations and additions that don’t change semantics significantly. However, we’re trying to keep the changes to the document minimal, so only absolutely necessary changes will be considered.

The draft is available at http://meetings.mpi-forum.org/draft_standard/mpi3.0_draft_2.pdf.

If you find issues, either contact the Forum member of your choice or broadcast to a larger group via the mailinglist (http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-comments, you need to be subscribed). We accept comments until September 6th and we thank you for your support!

Happy reviewing and exploring of all the new features! PS: my earlier posts on the new MPI-3.0 features remain completely valid and provide a good overview of the news in MPI-3.0. If you are interested to learn more about MPI-3, you may also consider to look through some of the tutorial materials that I and Martin provided! We are planning to offer this tutorial at future conferences, so watch out!

Jul162012

Moving Part 2 – nearly got killed!

htorUncategorized

Oh well, who thought that a move could actually be dangerous?

Well, first, this is the first move in my life where I had to throw away most of my things. Well, we could have taken them to Switzerland but I thought they’re not quite valuable enough to justify the transport costs (~$8000). The only problem in this equation is that my employer would have paid the move but I will now pay for the new acquisitions. Oh well, I guess I’m just nice!

I disposed two full industry-grade (apartment) trashcans like this:
move0

A lot of good stuff :-(. But most of this stuff was still from my student time, so time to get new things!

The apartment was so clean after the move that our landlord even gave us $5 more than our deposit with the comment “that is the first apartment I don’t have to clean” :-).

Everything else was packed into a bunch of suitcases (will be fun to check them all). So a lot of nice luggage:
move1

But before leaving the US, I had to go to the Blue Waters Extreme Scaling Workshop and the MPI Forum. Meaning to haul the luggage around a couple of times … but also life-threatening danger.

So the scaling workshop (where I’m at right now) is held in the northwest suburbs (Des Plaines). At the evening of the first day, we decided to walk a bit around to catch some fresh air. It was 9pm-ish. Well, we shouldn’t have gone into that sided road where a group of about 10-15 people were walking on the sidewalk. Obviously some kind of hispanic gang (like in the movies, seriously). So while we discussed how we avoid them (there was no other street side), a black SUV drove by, and suddenly (without apparent reason), the gang started to throw stones and other things at the car. Seriously!? Right in front of us (10 meters away). Since they were running after the car (on the street), it solved our problem of avoiding them :-). But still, that was a $4000 damage right there. And if this wasn’t enough, the car was coming back (!?) and got another round of stones (don’t ask) and the gang was running in our direction before they dispersed. I guess we were right in the middle of a gang fight. No shots fired, yet. The car stopped in front of us and we decided to go back (the stone-throwers were gone).

Police was at the scene seconds later … on the way back to the hotel, we heard a shot and a bullet deflection sounds (most likely metal) on the same street. Oh man, don’t go for a walk in the suburbs.

More to come … off to the MPI Forum tomorrow.

Jul122012

Moving to Switzerland – Part 1

htorTravel • US

Oh well … the move comes closer. It’s actually much easier to move within the US (surprise!). The first part was packing everything (check!). I left nine bags of stuff (including my weightlifting set :-)) readily packed to check them into a plane at a friend’s house (thanks Cristina!). All bulkier and lighter items are in the trunk of my car (including the TV, the bar of the bar bell and other random stuff.

The highlight so far was the pickup of the car. The car is shipped with Schumacher Logistics. Everything went fine so far and the people are really nice. The trucker showed up a day early, which was fine, because I just returned from Germany (well, a looong day). He drove a gigantic truck and got lost in Champaign (he drove around while talking to me on the phone, dude …). So finally, he made it on Springfield and didn’t want to drive down the street to not scratch the cars on his truck with the trees (oh well, he had to back up). But that went well.

This middle-lane in the US is actually really useful, for example, for loading cars :-). Here are some pictures:

loading8
The truck – a seven-car carrier! Ridiculously huge.

loading0
Two cars already loaded when he came.

loading0
Inspecting our car for ridiculously small scratches. Man, this guy found all kinds of super-small scratches all over. Hope nothing more happens …

loading0
Ok, first try to load :-). I told him it wouldn’t fit …

loading0
Well, yeah, it didn’t fit (surprise). 🙂

loading0
Ok, 2nd try, lower deck. I was assuming he drives it into the middle … but …

loading0
He kinda stopped at the very end!? Oh well, I’m not a shipping pro.

loading0
The car was strapped to the truck with gigantic chains (the tires showed serious signs of pressure …).

loading0
I hope there are no bumpy streets, the distance between street and exhaust was not that great …

loading0
And there he goes … hoping for the best!

I’ll keep you posted!

May302012

MPI-3.0 is coming soon! Updates from the Japan Meeting.

htorHPC • MPI

The Japan MPI Forum was rather “mild” until the last day where we had all the votes. Several controversial things came up for vote and many things that were not ready were pushed for a vote. We were 16 organizations eligible for voting and each ticket would only need 9 yes votes to get in, rather small imho.

While I am not 100% sure about the decision making process in the Forum, I think we made mostly sane decisions (some exception are of course strengthening this rule :-)).

Executive summary:

no fault tolerance for MPI-3.0: the Forum decided against the proposal
no “true” nonblocking I/O functions in MPI-3.0
no helper threads in MPI-3.0
removing the C++ bindings passed the first vote — scary!

Now to the detailed actions/votes:
First Votes

the small fixes #187 and #192 passed
#194 (allow non-rectangular MPI_Dims_create) was withdrawn based on comments
#195 (topology awareness in MPI_Dims_create) was rejected because the ticket was obviously not ready/clean
#217 (helper Threads) was rejected, it was always controversial
#256 (MPI_PROC_NULL behavior for MPI_PROBE) passed
#271 (functions to query MPI_Info object) passed, George raised an issue with the naming that we should fox before the final release
#273 (immediate versions of nonblocking collective I/O routines) was rejected, got very close but raised the concerns that there is no optimized implementation even for the split I/O right now
#278 (update examples to not use deprecated constructs) passed
#281 (remove C++ bindings) passed, unfortunately, long live C++ exceptions!
#294 (MPI_UNWEIGHTED should not be NULL) passed (clearly)
#300 (minor issue in One Sided) passed (clearly)
#303 (move MPI-2 deprecated functions to new “Removed interfaces”) was rejected after some discussion
#313 (fixing init init and finalize) passed
#310 (clarify MPI behavior when multiple MPI processes run in the same address space) was rejected because it was felt that it”s inconsistent with #313
#317 (correct error related to MPI_REQUEST_FREE) passed (trivial fix)
#323 (new FT proposal) was rejected, it was controversial before and seemed to be edited until the last second. #326 and #327 were withdrawn based on that vote. So this means essentially no FT for MPI-3.0
#328 (fix MPI_PROC_NULL behavior for mprobe/improbe/mrecv/imrecv) passed

Second Votes

the two simple (nearly ticket-0) RMA changes #308 and #309 passed unanimously
#284 (allocate shared memory window) passed – woohoo! I thought the idea was dead a long time ago!
#280 (hindexed_block) passed, as expected
#168 (nonblocking communicator duplication) passed
#272 (remove C++ bindings in nonblocking colls) passed 🙁
#286 (noncollective communicator creation) passed!
#305 (update MPI_Intercomm_create to use collective tag space) passed