Science | Torsten Hoefler's blog

Feb192013

DFSSSP: Fast (high-bandwidth) Deadlock-Free Routing for InfiniBand Networks

The Open Fabrics Alliance just released a new version of then Open Subnet Manager including our Deadlock-Free SSSP Routing for InfiniBand (DFSSSP) routing algorithm [2]!

This new version fixes several minor bugs, adds the support for base/enhanced switch port 0 and improves the routing performance further, but lacks support for multicast routing (see ‘Update’ below).

DFSSSP is a new routing algorithm that can be used to route InfiniBand networks with OpenSM 3.3.16 [1] and later. It performs generally better than the default Min Hop algorithm and avoids deadlocks by routing through different virtual lanes (VLs). Due to the above-mentioned problems, we don’t recommend to use the DFSSSP routing algorithm which is included in the OFED 3.2 and 3.5 releases.

DFSSSP can lead to up to 50% higher routing performance for dense (bisection-limited) communication patterns, such as all-to-all and thus directly accelerates dense communication applications such as the Graph500 benchmark [4]. The following figure shows a direct comparison with other routing algorithms on a 726 node cluster running MPI with 1 process (in the 1024 process case, some nodes have two processes) per node:

netgauge_deimos

This comparison uses Netgauge’s effective bisection bandwidth benchmark, an approximation of the real bisection bandwidth of a network.

MPI_Alltoall performance is similarly improved over Min Hop and LASH routing as can be observed in the following figure (using 128 nodes):

mpialltoall

The new DFSSSP algorithm can be used with OpenSM version 3.3.16 starting it with ‘-R dfsssp’ on the command line or setting ‘routing_engine dfsssp’ in the configuration file. Despite the configuration of the routing algorithm, you will have to enable QoS with an uniform distribution (see [A1]) and you will have to enable service level query support within your MPI environment (see [A2] for OpenMPI).

You should compare the bandwidth yourself. Effective bisection bandwidth and all-to-all can be measured with Netgauge, however, real application measurements are always best!

Now you may be wondering why DFSSSP is faster than Min Hop since Min Hop is already minimizing the number of hops between all endpoints. The trick is that DFSSSP optimizes the *global bandwidth* in addition to the distance between endpoints. This is achieved with a simple greedy algorithm described in detail in [3]. Deadlock-freedom is then added by using different virtual lanes for the communication as described in [2]. By the way, Min Hop does not guarantee deadlock freedom! If you want to know more, read [2] and [3] or come to the HPC Advisory Council Switzerland Conference 2013 conference in March where I’ll give a talk about the principles behind DFSSSP and how to use it in practice.

DFSSSP is developed in collaboration between the main developer Jens Domke at the Tokio Institute of Technology, and Torsten Hoefler of the Scalable Parallel Computing Lab at ETH Zurich.

[1]: opensm-3.3.16.patched.tar.gz
[2]: J. Domke, T. Hoefler and W. Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
[3]: T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks
[4]: Graph 500: www.graph500.org
[5]: openmpi-1.6.4.patched.tar.gz

[A1] Possible QoS configuration for OpenSM + DFSSSP with 8 VLs:
qos TRUE qos_max_vls 8 qos_high_limit 4 qos_vlarb_high 0:64,1:64,2:64,3:64,4:64,5:64,6:64,7:64 qos_vlarb_low 0:4,1:4,2:4,3:4,4:4,5:4,6:4,7:4 qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7

[A2] Enable SL queries for the path setup within OpenMPI:
a) configure OpenMPI with "--with-openib --enable-openib-dynamic-sl" b) run your application with "--mca btl_openib_ib_path_record_service_level 1"

PS: we experienced some trouble with old HCA firmware, which did not support sending userspace MAD request on VLs other than 0.
You can test the following command (as root) on some nodes and see if you get an response from the subnet manager:
saquery -p --src-to-dst LID1:LID2
In case the command stalls or returns with an error you might have to update the firmware.

Update:
The fix for multicast routing has been implemented and tested. Please, use our patched version of opensm-3.3.16 (see [1]) instead of the default version from the OFED websites. Besides the multicast patch, this version contains a slightly enhanced implementation of the VL balancing. Future releases by the Open Fabrics Alliance (>= 3.3.17) will be shipped with both patches.
Besides the multicast problem, we have identified a bug in OpenMPI related to the connection management of the openib BTL. We provide a patched version of OpenMPI as well (see [5]).

Mar102012

MPI-3.0 chugging along

htorHPC • MPI • Science

Here are some updates from the March MPI Forum. We decided that the door has closed for new proposals, so MPI-3.0 could be ratified in the December meeting if everything goes well!

Otherwise, we made huge progress on many small things. Many readings and votes on minor tickets and the results can be found here. The most interesting proposals for me were #284 (Allocate a shared memory window), #286 (Noncollective Communicator Creation), and #168 (Nonblocking Communicator Duplication), which all passed their first vote. The Fortran bindings ticket #229 passed it’s second vote! Scalable vector collectives (#264) were postponed to the next MPI version because the Forum felt that they would need more investigation of several alternative options.

I explained those and other interesting tickets in my last post on MPI-3.0.

We also made substantial progress on Fault Tolerance (which remains a controversial topic for several reasons) and a lot of cleanup (thanks Rolf!). The next meeting in Japan will be exciting!

Feb62012

MPI-3.0 is Coming—an Overview of new (and old) Features

htorHPC • MPI • Science

UPDATE: The new MPI-3 book appeared. This book describes all information on this page in a well-written form (including other advanced MPI features) and with examples for practitioners. More information. Direct link to Amazon.

I am involved in the MPI Forum which is developing and ratifying the Message Passing Interface standards. Actually, I managed to attend every single MPI Forum meeting (27 so far) since the Forum reconvened in Jan. 2008 and I also co-authored MPI-2.1 and MPI-2.2.

The MPI Forum strives to release MPI-3.0 asap (which may mean in a year or so ;-)), so most, if not all significant proposals are in a feature-freeze and polishing stage. I’ll try to summarize the hot topics in MPI-3.0 (in no particular order) here. The Forum is public (join us!) and all meetings and activities are documented at http://meetings.mpi-forum.org/. However, the wiki and meeting structure is hard to follow for people who do not regularly attend the meetings (actually, it’s even hard for people who do so).

The MPI-3.0 ticket process is relatively simple: a ticket is put together by a subgroup or an individual and discussed in the chapter working group. Then it is brought forward for discussion to the full Forum, formally read in a plenary session and voted twice. The reading and the votes happen at different meetings, i.e., a ticket needs at least six months to be ratified (this gives the Forum time to check it for correctness). Non-trivial changes are also not possible after a reading. Both votes have to be passed for the ticket to be ratified. Then, it is integrated into the draft standard by the chapter author(s). Finally, at the end of the process, each chapter is voted by the Forum, and after that (mostly a formality) there will be a vote for the whole standard. Votes are by organization and an organization has to participate regularly in the Forum to be eligible to vote (has to be presented on two of the three meetings before the vote). Input from the public is generally valued and should be communicated through the mailinglist or Forum members.

Keep in mind that this list and the comments are representing my personal view and only the final standard is the last word! You can find the original tickets by appending the ticket ID to https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/ , e.g., https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/109 for nonblocking collective operations.

1) Nonblocking Collective Operations #109

Status: passed

Nonblocking collectives #109 was the first proposal voted into MPI-3.0 more than a year ago (Oct. 2010). The proposal dates back to one of the first meetings in 2008 (I wanted it in MPI-2.2 but we decided to save this “major change” for MPI-3 to make the adoption of MPI-2.2 faster). Since this proposal came first, it was used to define much of the process for MPI-3.0 and it was also probably scrutinized most :-). Actually, it seems rather simple but there were some subtle corner-cases that needed to be defined. But after all, it allows one to issue “immediate” (that’s where the “I” comes from) collective operations, such as:
MPI_Ibcast(buf, count, type, root, comm, &request); ... // compute MPI_Wait(&request, &status);

This can be used to overlap computation and communication and enables several use-cases such as software pipelining (cf. Hoefler, Gottschling, Lumsdaine: “Leveraging Non-blocking Collective Communication in High-performance Applications”) or also interesting parallel protocols that require the nonblocking semantics (cf. Hoefler, Siebert, Lumsdaine: “Scalable Communication Protocols for Dynamic Sparse Data Exchange”).

A reference implementation is available with LibNBC and I’m looking forward to optimized platform-specific versions with full asynchronous progression! The latest (svn) version of MPICH2 is already supporting them (while some other MPI implementations are still working on MPI-2.2 compliance).

2) Neighborhood Collectives #258

Status: passed

Neighborhood (formerly aka. sparse) collective operations are extending the distributed graph and Cartesian process topologies with additional communication power. A user can now statically define a communication topology and also perform communication functions between neighbors in this topology! For example:

// create a 3d topology MPI_Cart_create(comm, 3, {2,2,2}, {1,1,1}, 1, &newcomm); ... // read input data according to process order in newcomm while(!converged) { // start neighbor communication MPI_Ineighbor_alltoall(..., &newcomm, &req); ... // compute inner parts MPI_Wait(&req, MPI_STATUS_IGNORE); ... // compute outer parts }

This obviously simplifies the MPI code quite a bit (compared to the old “north, south, west, east” exchanges with extra pack/send/recv/unpack code for each direction) and often improves performance. This can also be nicely combined with MPI datatypes (neighbor_alltoallw) to offer a very high abstraction level. Distributed graph communicators enable the specification of completely arbitrary communication relations. A more complex (application) example is described in Hoefler, Lorenzen, Lumsdaine: “Sparse Non-Blocking Collectives in Quantum Mechanical Calculations”.

The MPI implementation can optimize the topology and the message schedule for those functions in the graph or Cartesian communicator creation call. Optimization opportunities and a neighborhood_reduce call (which the Forum decided to remove from the proposal) are discussed in Hoefler, Traeff: “Sparse Collective Operations for MPI”.

3) Matched probe #38

Status: passed

One of the oldest tickets that we (Doug Gregor, who originally identified the problem when providing C# bindings for MPI, and I) proposed to MPI-2.2. It was deferred to MPI-3.0 for various reasons. This ticket fixes an old bug in MPI-2 where one could not probe for messages in a multi-threaded environment. The issue is somewhat subtle and complex to explain. For a good examples and a description of the complexity of the problem and the performance of the solution, refer to Hoefler, Bronevetsky, Barrett, de Supinski, Lumsdaine: “Efficient MPI Support for Advanced Hybrid Programming Models”.

The new interface works by removing the message at probe time from the matching queue and allowing the receiver to match it later with a special call:

MPI_Mprobe(source, tag, comm, &message, &status); ... // prepare buffer etc. MPI_Mrecv(buf, count, type, &message, &status);

This avoids “bad” thread interleavings that lead to erroneous receives. Jeff has a good description of the problem in his blog.

4) MPIT Tool Interface #266

Status: passed

The new MPI tool interface allows the MPI implementation to expose certain internal variables, counters, and other states to the user (most likely performance tools). The huge difference to the various predecessor proposals is that it does not impose any specific structure or implementation choice (such as having an eager protocol) on the MPI implementations. Another side-effect of this is that is doesn’t really have to offer anything to the user :-). However, a “high quality” MPI implementation may use this interface to expose relevant state.

It will certainly be very useful for tools and advanced MPI users to investigate performance issues.

5) C Const Correctness #140

Status: passed

This sounds rather small but came with a major pain to pass it (anybody remembers why?). This ticket basically makes the C interface const-correct, i.e., adds the const qualifier to all C interface functions. All C++ functions already have const qualifiers.

This turns
int MPI_Gather(void* , int, MPI_Datatype, void*, int, MPI_Datatype, int, MPI_Comm);
into
int MPI_Gather(const void* , int, MPI_Datatype, void*, int, MPI_Datatype, int, MPI_Comm);
and thus allows several compiler optimizations and prevents some user errors (produces compiler warnings at least).

6) Updated One Sided Chapter #270

Status: passed

This proposal killed probably half of my free-time in the last years. It started at the Portland Forum meeting in 2009 where another group was proposing to essentially rewrite the MPI-2 One Sided chapter from scratch. I disagreed vehemently because the proposed text would not allow an implementation on systems that were not cache coherent. MPI-2 handled the cache coherency issue very elegantly but was in many places hard to use and even harder to understand.

After a night of re-writing the existing chapter to differentiate between two memory models (essentially “cache-coherent” and “not cache-coherent” in MPI lingo “unified” and “separate” public and private window) a new proposal was born. A subgroup started to bang on it (and erased essentially 80% of the ideas, replacing them with better ones!) and two years later we had what is probably the hairiest part of MPI (memory models are extremely complex). The RMA working group was a lot of fun, many inspiring discussions lead us to a good solution! This chapter was a good example of an excellent group effort!

The new chapter offers:

two memory models: one supporting cache-coherent systems (similar to many PGAS languages) and the other one is essentially the “old” MPI-2 model
different ordering modes for accumulate accesses (warning: the safe default mode may be easy to reason about but slower)
MPI_Win_allocate, a collective window creation function that allocates (potentially symmetric or specialized) memory for faster One Sided access
MPI_Win_create_dynamic, a mechanism to create a window that spans the whole address space together with functions to register (MPI_Win_attach) and deregister (MPI_Win_detach) memory locally
MPI_Get_accumulate, a fetch-and-accumulate function to atomically fetch and apply an operation to a variable
MPI_Fetch_and_op, a more specialized version of MPI_Get_accumulate with less parameters for atomic access to scalars only
MPI_Compare_and_swap, a CAS function as we know it from shared memory multiprogramming
MPI_R{put,get,accumulate,get_accumulate}, request-based MPI functions for local completion checking without window synchronization functions
MPI_Win_{un}lock_all, a function to (un)lock all processes in a window from a single process (not collective!)
MPI_Win_flush{_all}, a way to complete all outstanding operations to a specific target process (or all processes). Upon return of this function, the operation completed at the target (either in the private or public window copy)
MPI_Win_flush_local{_all}, a function to complete all operations locally to a specified process (or all processes). This does not include remote completion but local buffers can be re-used
conflicting accesses are now allowed but the outcome is undefined (and may corrupt the window). This is similar to the C++ memory model

Of course, nobody can understand the power of the new One Sided interface based on this small list without examples. The One Sided working group is working on more documentation and similar posts, I plan to link or mirror them here!

7) Allocating a Shared Memory Window #284

Status: read

Several groups wanted the ability to create shared memory in MPI. This would allow to share data-structures across all MPI processes in a multicore node similarly to OpenMP. However, unlike OpenMP, one would just share single objects (arrays etc.) and not the whole address space. The idea here is to combine this with One Sided and allow to create a window which is accessible with load/store (ISA) instructions to all participating processes.

This extends the already complex One Sided chapter (semantics) with the concept of local and remote memory. The proposal is still under discussion and may change. Currently, one can create such a window with MPI_Win_allocate_shared(size, info, comm, baseptr, win) and then use One Sided synchronization (flush and friends) to access it.

By default, the allocated memory is contiguous across process boundaries (process x’s memory starts after process x-1’s memory ends). The info argument alloc_shared_noncontig can be used to relax this and allow the implementation to allocate memory close to a process (on NUMA systems). Then, the user has to use the function MPI_Win_shared_query() to determine the base address of remote processes’ memory segment.

MPI-3.0 will also offer a special communicator split function that can be used to create a set of communicators which only include processes that can create a shared memory window (i.e., mutually share memory).

8 ) Noncollective Communicator Creation #286

Status: read

A very interesting proposal to allow a group of processes to create a communicator “on their own”, i.e., without involving the full parent communicator. This would ve very useful for MPI fault tolerance, where it could be used to “fix” a broken communicator (create a communicator with less processes). Compare this to Gropp, Lusk: “Fault Tolerance in MPI Programs”. This could be achieved with current functions but would be slow and cumbersome, see Dinan et al.: “Noncollective Communicator Creation in MPI“.

9) Nonblocking MPI_Comm_dup #168

Status: read

This very simple proposal allows to duplicate communicators in a nonblocking way. This allows to overlap the communicator creation latency and also implement “purely” nonblocking functions without initialization calls (cf. Hoefler, Snir: “Writing Parallel Libraries with MPI – Common Practice, Issues, and Extensions”). There is not much more to say about this simple call :-).

10) Fortran Bindings #229 (+24 more!)

Status: voted

This is a supposedly simple ticket that was developed to improve the Fortran bindings and add Fortran 2008 bindings. It offers type-safety and tries to resolve the issue with Fortran code movement (relying on Fortran TR 29113). I am not a Fortran expert (preferring C++ for scientific computing instead) so I can’t really speak to it. Jeff has a good post on this.

That’s it! Well, I am 100% sure that I forgot several proposals (some may even be still in the pipeline or below the radar) and I’ll add them here as they show up. We also already postponed several features to MPI-3.1, so the MPI train continues to run.

The Forum is also actively seeking feedback from the community. If you are interested in any of the described features, please give the draft standard a read and let us know if you have concerns (or praise :-))!

Torsten Hoefler

Nov192011

An SC11 story – walking by OccupySeattle and Space Needle …

htorHPC • Science • Travel • US

A couple of days ago was one of those nights where I went back from the parties to my (slightly remote) hotel. I was passing OccupySeattle every day … but this time it was full of police. The funniest part was that the cops told me to leave while I was just walking by … I mean, seriously, I didn’t even stop. Actually, they stopped me in order to tell me to leave. I didn’t think it was worth mentioning until I saw this. I think the whole movement is really fed by such stories in a very grim way.

The movement is interesting and the Seattle one is especially noteworthy since the weather is really bad.

Well, well … very strange. Btw., the conference was absolutely great! Well, it was a bit too small (crowded) and the Party in the Space Needle was rather disastrous (reminded me a bit of the SC08 Texas thing without food). It was the opposite though this time — there was a lot of excellent food and drinks, but there was simply no space to stand. And getting up the needle was a 1-hour effort, well, Jim and I found a secret shortcut ;-). Here is the proof:

573273-1565463-797-L

SC11 is over now! And we even had a 1.5 hours break before SC12 started for the committee :-). It was a great show, bigger, better everything. ~5k people in the technical program and ~12k total.

PS: I know that Mount Rainier and the Space Needle can not fit on a picture like this … it’s called artistic freedom!

Nov142011

Next year’s plans

htorScience

Many knew that I was planning my return to Europe since the beginning of this year. I want to remark again that my move is exclusively motivated by personal reasons; I like the US and especially UIUC and my alma mater IU very much. I applied about a year ago to the Assistant Professor position at ETH and some time after that to the Helmholtz (HGF) Junior Investigator group program with the Juelich Supercomputing Center. Btw., this was the first (and second) time that I applied for something in my
life—I somewhat slid into all other jobs and programs.

Against all odds, both applications succeeded and both offers were extremely close regarding the funding. Now I was facing the paradox of choice (see Schwartz and his talk). This was horrible because I knew that both places invested in me and I had to disappoint somebody. I hate this, seriously.

Juelich is *the* top supercomputing center in Europe and ETH is *the* top research university in (mainland) Europe (with people like Einstein as alumni). It was a very hard choice and I took some time to make it final.

Since my career-goal is to be a professor at a research university, I decided to accept the position at ETH. The environment and colleagues also seem very very promising. I hope my friends in Juelich understand my decision.

So now it’s official. I’ll move to Zuerich, Switzerland (most likely) in August next year to start a position as Assistant Professor at the ETH Zuerich.

juelich_forschungszentrum
A view of Juelich. I like the natural forest setting a lot!

ethz
zurich
Views of ETH and Zurich.

Nov62011

The new ACM SIGHPC!

htorHPC • Science

We (the CS researchers working on parallel computing and HPC) finally made it! We got our own special interest group in ACM — SIGHPC. I think it’s a great opportunity for us to have a forum for the HPC researchers. I joined immediately.

Oct282011

UIUC’s CITES – wow …

htorHPC • Science

I know, it seems like I’m only complaining in my blog in the last weeks … but seriously, something like this never happened to me before. You know, UIUC is a top-5 department in computer science and it is really astonishing how incompetent CITES (Campus Information Technologies and Educational Services) is.

For some unclear reason, they decided that running an IMAP and STMPS server must be too hard so they switched the mail service to Microsoft Exchange (yes, a public university!!!). Well, I am rather sure that every grad student in the department would be able to run an IMAP or SMTPS service. But ok, anyway, so they switched my mail account. The account was 100% forwarded to another mailserver (NCSA) with a good old .forward file :-). So I was assuming it didn’t work and sent regular test-emails to my mailadress. They made it.

But I realized today that some emails (every 10th, most come from @illinois.edu) are not forwarded but delivered to my Outlook Inbox (which I never checked). All of them went to the same email address htor@university.edu. This is so unbelievably bad that I missed a whole thread of important discussions with admins (involving grant money). This is really bad …. and they won’t let me run my own SMTP server :-(.

Thanks for the trouble CITES!

Sep12011

HotI Aftermath

htorScience • Travel

Here is a nice video-interview of Fabrizio during HotI this year: http://insidehpc.com/2011/08/26/22573/. Enjoy!

Jul292011

Hot Interconnects 2011 teaser and highlights!

htorHPC • MPI • Science

Rich Brueckner from insideHPC interviews the Co-Chairs of HotI’11 (Fabrizio Petrini, Patrick Geoffray, and I) in a podcast. Citing one of the chairs “Listen to it, it’s fun! Three horrible accents but a great program!” :-).

http://insidehpc.com/2011/07/28/podcast-hot-interconnects-conference-looks-to-the-future-of-networks/

Jun12011

Visiting ETH Zurich

htorScience • Travel

I visited ETH Zurich at the end of last month. Zurich is probably one of the most beautiful cities in Europe and definitely one of (if not the) most expensive city :-). Public transit is just a dream, I believe one really doesn’t need a car around the city (I know multiple people who live there and don’t have a car). The city is very hilly (I had to climb 1000 stairs to get from the main train station to my hotel, quite workout :-)). ETH itself sits majestically on top of a hill and offers a beautiful view. Especially the Professor’s lunch area, on top of the main building has an absolutely stunning view on the city and the lake. I also had many very interesting technical discussions and like the CS department there. Here are some pictures:

IMGP1245_s
The gate to the computer science and chemistry building.

IMGP1246_s
The entrance to the computer science building, all very impressive. The doors must weigh a ton but they are automatic (not like in Chemnitz where one really has to move the heavy doors ;-)).