What are the real differences between RDMA, InfiniBand, RMA, and PGAS?

I often get the question how the concepts of Remote Direct Memory Access (RDMA), InfiniBand, Remote Memory Access (RMA), and Partitioned Global Address Space (PGAS) relate to each other. In fact, I see a lot of confusion in papers of some communities which discovered these concepts recently. So let me present my personal understanding here; of course open for discussions! So let’s start in reverse order :-) .

PGAS is a concept relating to programming large distributed memory machines with a shared memory abstraction that distinguishes between local (cheap) and remote (expensive) memory accesses. PGAS is usually used in the context of PGAS languages such as Co-Array Fortran (CAF) or Unified Parallel C (UPC) where language extensions (typically distributed arrays) allow the user to specify local and remote accesses. In most PGAS languages, remote data can be used like local data, for example, one can assign a remote value to a local stack variable (which may reside in a register) — the compiler will generate the needed code to imlement the assignment. A PGAS language can be compiled seamlessly to target a global load/store system.

RMA is very similar to PGAS in that it is a shared memory abstraction that distinguishes between local and remote memory accesses. RMA is often used in the context of the Message Passing Interface standard (even though it does not deal with passing messages ;-) ). So why then not just calling it PGAS? Well, there are some subtle differences to PGAS: MPI RMA is a library interface for moving data between local and remote memories. For example, it cannot move data into registers directly and may be subject to additional overheads on a global load/store machine. It is designed to be a slim and portable layer on top of lower-level data-movement APIs such as OFED, uGNI, or DMAPP. One main strength is that it integrates well with the remainder of MPI. In the MPI context, RMA is also known as one-sided communication.

So where does RDMA now come in? Well, confusingly, it is equally close to both PGAS and it’s Hamming-distance-one name sibling RMA. RDMA is a mechanism to directly access data in remote memories across an interconnection network. It is, as such, very similar to machine-local DMA (Direct Memory Access), so the D is very significant! It means that memory is accessed without involving the CPU or Operating System (OS) at the destination node, just like DMA. It is as such different from global load/store machines where CPUs perform direct accesses. Similarly to DMA, the OS controls protection and setup in the control path but then removes itself from the fast data path. RDMA always comes with OS bypass (at the data plane) and thus is currently the fastest and lowest-overhead mechanism to communicate data across a network. RDMA is more powerful than RMA/PGAS/one-sided: many RDMA networks such as InfiniBand provide a two-sided message passing interface as well and accelerate transmissions with RDMA techniques (direct data transfer from source to remote destination buffer). So RDMA and RMA/PGAS do not include each other!

What does this now mean for programmers and end-users? Both RMA and PGAS are programming interfaces for end-users and offer several higher-level constructs such as remote read, write, accumulates, or locks. RDMA is often used to implement these mechanisms and usually offers a slimmer interface such as remote read, write, or atomics. RDMA is usually processed in hardware and RMA/PGAS usually try to use RDMA as efficiently as possible to implement their functions. RDMA programming interfaces are often not designed to be used by end-users directly and are thus often less documented.

InfiniBand is just a specific network architecture offering RDMA. It wasn’t the first architecture offering RDMA and will probably not be the last one. Many others exist such as Cray’s RDMA implementation in Gemini or Aries endpoints. You may now wonder what RoCE (RDMA over Converged Ethernet) is. It’s simply an RDMA implementation over (lossless data center) Ethernet which is somewhat competing with InfiniBand as a wire-protocol while using the same verbs interface as API.

More precise definitions can be found in Remote Memory Access Programming in MPI-3 and Fault Tolerance for Remote Memory Access Programming Models. I discussed some ideas for future Active RDMA systems in Active RDMA – new tricks for an old dog.

How many measurements do you need to report a performance number?

The following figure from the paper “Scientific Benchmarking of Parallel Computing Systems” shows the completion times for multiple identical runs of a tuned version of the high-performance Linpack (HPL) on the same system. It illustrates how important correct measurements are. Here, one may report 77.4 Tflop/s but when repeating the benchmark see as little as 61.2 Tflop/s. It suggests that one should use sound statistics when reporting any performance result.

Computer science is often about measuring computer systems. Be it time, energy, or performance, all these metrics are often non-deterministic in real computer systems and a single measurement may or may not provide a reliable result. So if you are not sloppy when measuring your system, you will measure several executions and report an aggregate measure such as the arithmetic or geometric average or the median. Well, but now the question is: “how many is several”? And this is where it gets less clear.

Typically, “several” is defined very informally, so if the measurement is cheap (such as a network latency measurement), it can be 1,000 or even 1,000,000. If it’s expensive (such as full-scale supercomputer runs), we’re very quickly back to a single measurement. But does it make sense to define the number of measurements based on the execution cost? Of course not — it should depend on the variability of the data! Who would have thought that …?

Unfortunately, most benchmarkers do not take the data variability into account at all in practice. Why not? Isn’t that somewhat clear that one needs to? Yes, it is, but it’s also hard! But actually, it’s not that hard if one knows some basic statistics. The simplest way is to check if one has enough measurements for a given variability in the result. But how to assess the variability? Well, one needs to look at some samples — ah, a catch 22? I need samples to know how many samples I need? Yes, that is true — in fact, the more samples I have, the higher my confidence in the variability and the correctness of my reported number.

A simple technique to assess the confidence of my measurement (we are simplifying this somewhat here) is to compute the confidence interval. Confidence Intervals (CIs)
are a tool to provide a range of values that include the true mean with a given probability p depending on the estimation procedure. So if the measurement is 1 second and the 95% CI is the range [0.9;1.1] then there is a 95% probability that the true mean is within that interval. There are two basic types of CIs: (1) confidence intervals around the mean assuming a normal distribution and (2) nonparametric confidence intervals around the median without assumptions on the distribution. The former CI one is simplest to compute: [mean-t(n-1,p/2)/sqrt(n); mean+t(n-1,p/2)/sqrt(n)] where mean is the arithmetic mean, n is the number of samples, and t(x,p) is student’s t distribution with x degrees of freedom. So it’s easy to see that the interval quickly gets tighter when the number of samples grows. But which computing system generates measurements following a standard distribution, which means that it’s equally likely to become faster than slower. Well, my computers are certainly more often becoming slower than faster leading to a right-skewed distribution.

So how do we get to confidence intervals of non-normally distributed measurements? Well, first of all, if the data is not normally distributed, the average makes little sense as it will be skewed as well. So one usually reports the median (the n/2-th element in the sorted set of all n measurements) as the most likely value to be observed in practice. But how to get to our confidence interval? Since we cannot assume any distribution of the values, we work on the sorted set of measurements and call the rank-i value the ith value in the set. Now we identify rank floor((n-z(p/2)*sqrt(n))/2) to ceil(1+(n+z(p/2)*sqrt(n))/2) as the conservative CI which is commonly asymmetric as well.

So ok, we can now compute this CI as statistical measure of certainty of our reported median. Median what? Don’t we like averages? Well, again, averages are not too useful for non-normally distributed data *unless* you care about only an accumulation of many measurements, i.e., you only want to know how expensive 1,000 iterations are and you do not care about every single one. Well, if this is the case, just measure the 1,000. If you’re well-versed in statistics, you will now recognize the connection to the Central Limit Theorem :-) .

But now again, how many measurements do we actually need?? To answer this, we’d first need to define a needed level of certainty, for example 95%. Then, we define an accepted error in our reporting around the median, for example 1%. In other words, we would like to have enough measurements to be 95% sure that the real median is within 1% of our reported value. Hey, so we’re now back to a single reported value just together with a certainty! So how do we achieve this? Well, for normally distributed data in the case (1), one could compute the number of needed measurements. But that doesn’t work with real computers, so let’s skip this here. In the nonparametric case, no explicit formula is known to us, so we would need to recompute the confidence interval after each measurement (or a set of measurements) and we could stop measuring once the 95% CI is within the 1% interval around the mean.

Wow, so now we know how to *really* measure and report performance! In fact, in practice, we often need less than 1,000 measurements to reach a tight interval with high confidence. So if they’re cheap, we can as well do them and check afterwards of the statistics make sense. But what if we are running out of benchmarking budget before we reach the required accuracy — for example, each measurement takes a day and we only have four days but after four days, the CI is still wider than we’d like it to be? Well, bad luck! In that case, we can only report the wide CI and leave it up wo the reader/observer to conclude if our measurements make sense in his context.

I wish you happy (and correct) measuring! Torsten Hoefler

This blog post summarizes a part of the paper “Scientific Benchmarking of Parallel Computing Systems” which appeared at IEEE/ACM Supercomputing 2015. The full paper provides more insight and references around this topic and also the equation for the number of measurements assuming a normal distribution. The paper also establishes more rules for sound performance analyses that I may blog on later. Spread the word and cite the paper if you find these rules useful :-) .

11 SPCL@ETH activities at SC14

The Intl. Supercomputing (SC) conference is clearly the main event in HPC. It’s program is broad and more than 10k people attend annually. SPCL is mainly focused on the technical program which makes SC the top-tier conference in HPC. It is the main conference of a major ACM SIG (SIGHPC).

This year, SPCL members co-authored three technical papers in the very competitive program with several thousand attendees! One was even nominated for the best paper award — and to take it upfront, we got it! Congrats Maciej! All talks were very well attended (more than 100 people in the room).

All of these talks were presented by collaborators, so I was hoping to be off the hook. Well, not quite, because I gave seven (7!) invited talks at various events and participated in teaching a full-day tutorial on advanced MPI. The highlight was a keynote at the LLVM workshop. I was also running around all the time because I co-organized the overall workshop program (with several thousand attendees) at SC14.

So let me share my experience of all these exciting events in chronological order!

1) Sunday: IA3 Workshop on Irregular Applications: Architectures & Algorithms

This workshop was very nice. Kicked off by top-class keynotes from Onur Mutlu (CMU) and Keshav Pingali (UT) through great paper talks and a panel in the afternoon. I served on the panel with some top-class people and it was a lot of fun!


Giving my panel presentation on accelerators for graph computing.


Arguing during the panel discussion (Hadoop right now) with (left to right): Keshav Pingali (UT Austin), John Shalf (Berkeley), me (ETH), Clayton Chandler (DOD), Benoit Dupont de Dinechin (Kalray), Onur Mutlu (CMU, Maya Gokhale (LLNL). A rather argumentative group :-) .

My slides can be found here.

2) Monday – LLVM Workshop

It was long overdue to discuss the use of LLVM in the context of HPC. So thanks to Hal Finkel and Jeff Hammond for organizing this fantastic workshop! I kicked it off with some considerations about runtime-recompilation and how to improve codes.

The volunteers counted around 80 attendees in the room! Not too bad for a workshop. My slides on “A case for runtime recompilation in HPC” are here.

3) Monday – Advanced MPI Tutorial

Our tutorial attendee numbers keep growing! More than 67 people registered but it felt like more were showing up for the tutorial. We also released the new MPI books, especially the “Using Advanced MPI” book which shortly after became the top new release on Amazon in the parallel processing category.

4) Tuesday – Graph 500 BoF

There, I released the fourth Green Graph 500 list. Not much new happened on the list (same as for the Top500 and Graph500) but the BoF
was still fun! Peter Kogge presented some interesting views on the data of the list. My slides can be found here.

5) Tuesday – LLVM BoF

Concurrently with the Graph 500 BoF was the LLVM BoF, so I had to speak at both at the same time. Well, that didn’t go too well (I’m still only one person — apologies to Jim). I only made 20% of this BoF but it was great! Again, very good turnout, LLVM is certainly becoming more important every year. My slides are here.

6) Tuesday – Simulation BoF

There are many simulators in HPC! Often for different purposes but also sometimes for similar ones. We discussed how to collaborate and focus our efforts better. I represented LogGOPSim, SPCL’s discrete event simulator for parallel applications.

My talk summarized features and achievements and slides can be found here.

7) Tuesday – Paper Talk “Slim Fly: A Cost Effective Low-Diameter Network Topology”

Our paper was up for Best Student Paper and Maciej did a great job presenting it. But no need to explain, go and read it here!


Maciej presenting the paper! Well done.

8) Wednesday – PADAL BoF – Programming Abstractions for Data Locality

Programming has to become more data-centric as architectures evolve. This BoF followed an earlier workshop in Lugano on the same topic. It was great — no slides this time, just an open discussion! I hope I didn’t upset David Padua :-) .


Didem Unat moderated and the panelists were — Paul Kelly (Imperial), Brad Chamberlain (Cray), Naoya Maruyama (TiTech), David Padua (UIUC), me (ETH), Michael Garland (NVIDIA). It was a truly lively BoF :-) .

But hey, I just got it in writing from the Swiss that I’m not qualified to talk about this topic — bummer!


The room was packed and the participation was great. We didn’t get to the third question! I loved the education question, we need to change the way we teach parallel computing.

9) Wednesday – Paper Talk “Understanding the Effects of Communication and Coordination on Checkpointing at Scale”

Kurt Ferreira, a collaborator from Sandia was speaking on unexpected overheads of uncoordinated checkpointing analyzed using LogGOPSim (it’s a cool name!!). Go read the paper if you want to know more!


Kurt speaking.

10) Thursday – Paper Talk “Fail-in-Place Network Design: Interaction between Topology, Routing Algorithm and Failures”

Presented by Jens Domke, a collaborator from Tokyo Tech (now at TU Dresden). A nice analysis of what happens to a network when links or routers fail. Read about it here.


Jens speaking.

11) Thursday – Award Ceremony

Yes, somewhat unexpectedly, we go the best student paper award. The second major technical award in a row for SPCL (after last year’s best paper).


Happy :-) .

Coverage by Michele @ HPC-CH and Rich @ insideHPC.

The MPI 3.0 Book – Using Advanced MPI

Our book on “Using Advanced MPI” will appear in about a month — now it’s the time to pre-order on Amazon at a reduced price. It is released by the prestigious MIT Press, a must read for parallel computing experts.

The book contains everything advanced MPI users need to know. It presents all important concepts of MPI 3.0 (including all newly added functions such as nonblocking collectives and the largely extended One Sided functionality). But the key is that the book is written in an example-driven style. All functions are motivated with use-cases and working code is available for most. This follows the successful tradition of the “Using MPI” series lifting it to MPI-3.0 and hopefully makes it an exciting read!

David Bader’s review hits the point

With the ubiquitous use of multiple cores to accelerate applications ranging from science and engineering to Big Data, programming with MPI is essential. Every software developer for high performance applications will find this book useful for programming on modern multicore, cluster, and cloud computers.

Here is a quick overview of the contents:

Section 1: “Introduction” provides a brief overview of the history of MPI and briefly summarizes the basic concepts.

Section 2: “Working with Large Scale Systems” contains examples of how to create highly-scalable systems using nonblocking collective operations, the new distributed graph topology for MPI topology mapping, neighborhood collectives, and advanced communicator creation functions. It equips readers with all information to write codes that are highly-scalable. It even describes how fault-tolerant applications could be written using a high-quality MPI implementation.

Section 3: “Introduction to Remote Memory Operations” is a gentle and light introduction to RMA (One Sided) programming using MPI-3.0. It starts with the concepts of memory exposure (windows) and simple data movement. It presents various example problems followed by practical advice to avoid common pitfalls. It concludes with a discussion on performance.

Section 4: “Advanced Remote Memory Access” will make you a clear expert in RMA programming, it covers advanced concepts such as passive target mode, allocating MPI windows using various examples. It also discusses memory models and scalable synchronization approaches.

Section 5: “Using Shared Memory with MPI” explains MPI’s strategy to shared memory. MPI-3.0 added support for allocating shared memory which essentially enables the new hybrid programming model “MPI+MPI“. This section explains guarantees that MPI provides (and what it does not provide) and several use-cases for shared memory windows.

Section 6: “Hybrid Programming” provides a detailed discussion on how to use MPI in cooperation with other programming models, for example threads or OpenMP. Hybrid programming is emerging to a standard technique and MPI-3.0 introduces several functions to ease the cooperation with others.

Section 7: “Parallel I/O” is most important in the future Big Data world. MPI provides a large set of facilities to support operations on large distributed data sets. We discuss how MPI supports contiguous and noncontiguous accesses as well as the consistency of file operations. Furthermore, we provide hints for improving the performance of MPI I/O.

Section 8: “Coping with Large Data” once Big Data sets are in main memory, we may need to communicate them. MPI-3.0 supports handling large data (>2 GiB) through derived datatypes. We explain how to enable this support and limitations of the current interface.

Section 9: “Support for Performance and Correctness Debugging” is addressed at very advanced programmers as well as tool developers. It describes the MPI tools interface which allows to introspect internals of the MPI library. Its flexible interface supports performance counter and control variables to influence the behavior of MPI. Advanced expert programmers will love this interface for architecture-specific tuning!

Section 10: “Dynamic Process Management” explains how processes can be created and managed. This feature enables growing and shrinking of MPI jobs during their execution and fosters new programming paradigms if it is supported by the batch systems. We only discuss the MPI part in this chapter though.

Section 11: “Working with Modern Fortran” is a must-read for Fortran programmers! How does MPI support type-safe programming and what are the remaining pitfalls and problems in Fortran?

Section 12: “Features for Libraries” addresses advanced library writers and described principles how to develop portable high-quality MPI libraries.

ExaMPI’13 Workshop at SC13

I wanted to highlight the ExaMPI’13 workshop at SC13. It was a while ago but it is worth reporting!

The workshop’s theme was “Exascale MPI” and the workshop addressed several topics on how to move MPI to the next big divisible-by-10^3 floating point number. Actually, for Exascale, it’s unclear if it’s only FLOPs, maybe it’s data now, but then, we easily have machines with Exabytes :-) . Anyway, MPI is an viable candidate to run on future large-scale machines, maybe at a low level.

A while ago, some colleagues and I summarized the issues that MPI faces in going to large scale: “MPI on Millions of Cores“. The conclusion was that it’s possible to move forward but some non-scalable elements need to be removed or avoided in MPI. This was right on topic for this workshop, and indeed, several authors of the paper were speaking!

The organizers invited me to give a keynote to kick off the event. I was talking about large-scale MPI and large-scale graph analysis and how this could be done in MPI. [Slides]

The very nice organizers sent me some pictures that I want to share here:


My keynote on large-scale MPI and graph algorithms.


The gigantic room was well filled (I’d guess more than 50 people).


Jesper talking about the EPIGRAM project to address MPI needs for the future large scales.


The DEEP strategy of Julich using inter-communicators (the first users I know of).


Pavan on our heterogeneous future, very nice insights.

All in all, a great workshop with a very good atmosphere. I received many good questions and had very good discussions afterwards.

Kudos to the organizers!

Emerging Technologies ramping up at SC13

A new element in this year’s Supercomputing SC13 conference, Emerging Technologies, is emerging at SC13 right now. The booth of impressive size (see below) features 17 diverse high-impact projects that will change the future of Supercomputing!

et_booth_bringup

Emerging Technologies (ET) is part of the technical program and all proposals have been reviewed in an academically rigorous process. However, as opposed to the standard technical program, ET will be located at the main showfloor (booth #3547). This enables to demonstrate technologies and innovations that would otherwise not reach the showfloor.

The standing exhibit is complemented by a series of short talks about the technologies. Those talks will be during the afternoons on Tuesday and Wednesday in the neighboring “HPC Impact Showcase” theater (booth #3947).

Check out http://sc13.supercomputing.org/content/emerging-technologies for the booth talks program!

Bob Lucas and I have been organizing the technical exhibit this year and Satoshi Matsuoka will run it for SC14 :-) .

So make sure to swing by if you’re at SC13. It’ll definitely be a great experience!

Advanced MPI Programming Tutorial at Supercomputing 2013

Pavan Balaji, Jim Dinan, Rajeev Thakur and I are giving our Advanced MPI Programming tutorial at Supercomputing 2013 on Sunday November 17th.

Are you wondering about the new MPI-3 standard? How it affects you as a scientific or HPC programmer and what nice new features you can use to make your life easier and your application faster? Then you should not miss our tutorial.

Our abstract summarizes the main topics:

The vast majority of production parallel scientific applications today use MPI and run successfully on the largest systems in the world. For example, several MPI applications are running at full scale on the Sequoia system (on ?1.6 million cores) and achieving 12 to 14 petaflops/s of sustained performance. At the same time, the MPI standard itself is evolving (MPI-3 was released late last year) to address the needs and challenges of future extreme-scale platforms as well as applications. This tutorial will cover several advanced features of MPI, including new MPI-3 features, that can help users program modern systems effectively. Using code examples based on scenarios found in real applications, we will cover several topics including efficient ways of doing 2D and 3D stencil computation, derived datatypes, one-sided communication, hybrid (MPI + shared memory) programming, topologies and topology mapping, and neighborhood and nonblocking collectives. Attendees will leave the tutorial with an understanding of how to use these advanced features of MPI and guidelines on how they might perform on different
platforms and architectures.

This tutorial is about advanced use of MPI. It will cover several advanced features that are part of
MPI-1 and MPI-2 (derived datatypes, one-sided communication, thread support, topologies and topology
mapping) as well as new features that were recently added to MPI as part of MPI-3 (substantial additions
to the one-sided communication interface, neighborhood collectives, nonblocking collectives, support for
shared-memory programming).

Implementations of MPI-2 are widely available both from vendors and open-source projects. In addition,
the latest release of the MPICH implementation of MPI supports all of MPI-3. Vendor implementations
derived from MPICH will soon support these new features. As a result, users will be able to use in practice
what they learn in this tutorial.

The tutorial will be example driven, reflecting scenarios found in real applications. We will begin with
a 2D stencil computation with a 1D decomposition to illustrate simple Isend/Irecv based communication.

We will then use a 2D decomposition to illustrate the need for MPI derived datatypes. We will introduce
a simple performance model to demonstrate what performance can be expected and compare it with actual
performance measured on real systems. This model will be used to discuss, evaluate, and motivate the rest
of the tutorial.
We will use the same 2D stencil example to illustrate various ways of doing one-sided communication in
MPI and discuss the pros and cons of the different approaches as well as regular point-to-point communica-
tion. We will then discuss a 3D stencil without getting into complicated code details.
We will use examples of distributed linked lists and distributed locks to illustrate some of the new ad-
vanced one-sided communication features, such as the atomic read-modify-write operations.
We will discuss the support for threads and hybrid programming in MPI and provide two hybrid ver-
sions of the stencil example: MPI+OpenMP and MPI+MPI. The latter uses the new features in MPI-3 for
shared-memory programming. We will also discuss performance and correctness guidelines for hybrid pro-
gramming.

We will introduce process topologies, topology mapping, and the new “neighborhood” collective func-
tions added in MPI-3. These collectives are particularly intended to support stencil computations in a scalable
manner, both in terms of memory consumption and performance.
We will conclude with a discussion of other features in MPI-3 not explicitly covered in this tutorial
(interface for tools, Fortran 2008 bindings, etc.) as well as a summary of recent activities of the MPI Forum
beyond MPI-3.

Our planned agenda for the day is

  1. Introduction (8.30–10.00)
    • Background: What is MPI
    • MPI-1, MPI-2, MPI-3
    • 2D stencil code with 1D decomposition: Isend/Irecv version
    • 2D stencil code with 2D decomposition: Introduce derived datatypes
    • Introduce simple performance modeling and measurement
  2. One-Sided Communication (10.30–12.00)
    • Basics of one-sided communication or remote memory access (RMA)
    • 2D stencil code with 1D decomposition: RMA with 3 forms of synchronization
    • 3D stencil: What changes and what to pay attention to
    • Introduce other features of MPI-3 RMA
    • Linked list or distributed lock example demonstrating new MPI-3 RMA features
  3. Lunch (12.00–1.30)
  4. MPI and Threads (1.30–3.00)
    • What does the MPI standard specify about threads
    • How does it enable hybrid programming
    • Hybrid (MPI+OpenMP) version of 2D stencil code
    • Hybrid (MPI+MPI) version of 2D stencil code using MPI-3 shared-memory support
    • Performance and correctness guidelines for hybrid programming
  5. Topologies, Neighborhood/Nonblocking Collectives (3.30-5.00)
    • Topologies and topology mapping
    • 2D stencil code with 2D decomposition using neighborhood collectives
    • MPI-3 nonblocking collectives with example
    • Summary of other features in MPI-3
    • Summary of recent activities of the MPI Forum
    • Conclusions

We’re looking forward to many interesting discussions!

EuroMPI 2013 & Best Paper Award

EuroMPI is a very nice conference for the specialized sub-field MPI, namely the Message Passing Interface. I’m a long-term attendee since I’m working much on MPI and also standardization. We had a little more than 100 attendees this year in Madrid and the organization was just outstanding!

We were listening to 25 paper talks and five invited talks around MPI. For example Jesper Traeff, who discussed how to generalize datatypes towards collective operations:

Or Rajeev Thakur, who explained how we get to Exascale and that MPI is essentially ready:

Besides the many great talks, we also had some fun, like the city walking tour organized by the conference

the evening reception, a very nice networking event

or more networking in the Retiro park

followed by the traditional dinner.

On the last day, SPCL’s Timo Schneider presented our award-winning paper on runtime compilation for MPI datatypes

with a provocative start (there were many vendors in the room :-)

but an agreeing end.

The award ceremony followed right after the talk.

The conference was later closed by the announcement of next year, when EuroMPI will move to Japan (for the first time outside of Europe).

After all, a very nice conference! Kudos to the organizers.

The one weird thing about Madrid though … I got hit in the face by a random woman in the subway on my way back. Looks like she claimed I had stolen her seat (not sure why/how that happened and many other seats were empty) but she didn’t speak English and kept swearing at me. Weird people! :-)

DFSSSP: Fast (high-bandwidth) Deadlock-Free Routing for InfiniBand Networks

The Open Fabrics Alliance just released a new version of then Open Subnet Manager including our Deadlock-Free SSSP Routing for InfiniBand (DFSSSP) routing algorithm [2]!

This new version fixes several minor bugs, adds the support for base/enhanced switch port 0 and improves the routing performance further, but lacks support for multicast routing (see ‘Update’ below).

DFSSSP is a new routing algorithm that can be used to route InfiniBand networks with OpenSM 3.3.16 [1] and later. It performs generally better than the default Min Hop algorithm and avoids deadlocks by routing through different virtual lanes (VLs). Due to the above-mentioned problems, we don’t recommend to use the DFSSSP routing algorithm which is included in the OFED 3.2 and 3.5 releases.

DFSSSP can lead to up to 50% higher routing performance for dense (bisection-limited) communication patterns, such as all-to-all and thus directly accelerates dense communication applications such as the Graph500 benchmark [4]. The following figure shows a direct comparison with other routing algorithms on a 726 node cluster running MPI with 1 process (in the 1024 process case, some nodes have two processes) per node:

netgauge_deimos

This comparison uses Netgauge’s effective bisection bandwidth benchmark, an approximation of the real bisection bandwidth of a network.

MPI_Alltoall performance is similarly improved over Min Hop and LASH routing as can be observed in the following figure (using 128 nodes):

mpialltoall

The new DFSSSP algorithm can be used with OpenSM version 3.3.16 starting it with ‘-R dfsssp’ on the command line or setting ‘routing_engine dfsssp’ in the configuration file. Despite the configuration of the routing algorithm, you will have to enable QoS with an uniform distribution (see [A1]) and you will have to enable service level query support within your MPI environment (see [A2] for OpenMPI).

You should compare the bandwidth yourself. Effective bisection bandwidth and all-to-all can be measured with Netgauge, however, real application measurements are always best!

Now you may be wondering why DFSSSP is faster than Min Hop since Min Hop is already minimizing the number of hops between all endpoints. The trick is that DFSSSP optimizes the *global bandwidth* in addition to the distance between endpoints. This is achieved with a simple greedy algorithm described in detail in [3]. Deadlock-freedom is then added by using different virtual lanes for the communication as described in [2]. By the way, Min Hop does not guarantee deadlock freedom! If you want to know more, read [2] and [3] or come to the HPC Advisory Council Switzerland Conference 2013 conference in March where I’ll give a talk about the principles behind DFSSSP and how to use it in practice.

DFSSSP is developed in collaboration between the main developer Jens Domke at the Tokio Institute of Technology, and Torsten Hoefler of the Scalable Parallel Computing Lab at ETH Zurich.

[1]: opensm-3.3.16.patched.tar.gz
[2]: J. Domke, T. Hoefler and W. Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
[3]: T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks
[4]: Graph 500: www.graph500.org
[5]: openmpi-1.6.4.patched.tar.gz

[A1] Possible QoS configuration for OpenSM + DFSSSP with 8 VLs:

qos TRUE
qos_max_vls 8
qos_high_limit 4
qos_vlarb_high 0:64,1:64,2:64,3:64,4:64,5:64,6:64,7:64
qos_vlarb_low 0:4,1:4,2:4,3:4,4:4,5:4,6:4,7:4
qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7

[A2] Enable SL queries for the path setup within OpenMPI:

a) configure OpenMPI with "--with-openib --enable-openib-dynamic-sl"
b) run your application with "--mca btl_openib_ib_path_record_service_level 1"

PS: we experienced some trouble with old HCA firmware, which did not support sending userspace MAD request on VLs other than 0.
You can test the following command (as root) on some nodes and see if you get an response from the subnet manager:

saquery -p --src-to-dst LID1:LID2

In case the command stalls or returns with an error you might have to update the firmware.

Update:
The fix for multicast routing has been implemented and tested. Please, use our patched version of opensm-3.3.16 (see [1]) instead of the default version from the OFED websites. Besides the multicast patch, this version contains a slightly enhanced implementation of the VL balancing. Future releases by the Open Fabrics Alliance (>= 3.3.17) will be shipped with both patches.
Besides the multicast problem, we have identified a bug in OpenMPI related to the connection management of the openib BTL. We provide a patched version of OpenMPI as well (see [5]).

MPI-3.0 is coming soon! Updates from the Japan Meeting.

mpi3v2

The Japan MPI Forum was rather “mild” until the last day where we had all the votes. Several controversial things came up for vote and many things that were not ready were pushed for a vote. We were 16 organizations eligible for voting and each ticket would only need 9 yes votes to get in, rather small imho.

While I am not 100% sure about the decision making process in the Forum, I think we made mostly sane decisions (some exception are of course strengthening this rule :-) ).

Executive summary:

  • no fault tolerance for MPI-3.0: the Forum decided against the proposal
  • no “true” nonblocking I/O functions in MPI-3.0
  • no helper threads in MPI-3.0
  • removing the C++ bindings passed the first vote — scary!

Now to the detailed actions/votes:
First Votes

  • the small fixes #187 and #192 passed
  • #194 (allow non-rectangular MPI_Dims_create) was withdrawn based on comments
  • #195 (topology awareness in MPI_Dims_create) was rejected because the ticket was obviously not ready/clean
  • #217 (helper Threads) was rejected, it was always controversial
  • #256 (MPI_PROC_NULL behavior for MPI_PROBE) passed
  • #271 (functions to query MPI_Info object) passed, George raised an issue with the naming that we should fox before the final release
  • #273 (immediate versions of nonblocking collective I/O routines) was rejected, got very close but raised the concerns that there is no optimized implementation even for the split I/O right now
  • #278 (update examples to not use deprecated constructs) passed
  • #281 (remove C++ bindings) passed, unfortunately, long live C++ exceptions!
  • #294 (MPI_UNWEIGHTED should not be NULL) passed (clearly)
  • #300 (minor issue in One Sided) passed (clearly)
  • #303 (move MPI-2 deprecated functions to new “Removed interfaces”) was rejected after some discussion
  • #313 (fixing init init and finalize) passed
  • #310 (clarify MPI behavior when multiple MPI processes run in the same address space) was rejected because it was felt that it”s inconsistent with #313
  • #317 (correct error related to MPI_REQUEST_FREE) passed (trivial fix)
  • #323 (new FT proposal) was rejected, it was controversial before and seemed to be edited until the last second. #326 and #327 were withdrawn based on that vote. So this means essentially no FT for MPI-3.0
  • #328 (fix MPI_PROC_NULL behavior for mprobe/improbe/mrecv/imrecv) passed

Second Votes

  • the two simple (nearly ticket-0) RMA changes #308 and #309 passed unanimously
  • #284 (allocate shared memory window) passed – woohoo! I thought the idea was dead a long time ago!
  • #280 (hindexed_block) passed, as expected
  • #168 (nonblocking communicator duplication) passed
  • #272 (remove C++ bindings in nonblocking colls) passed :-(
  • #286 (noncollective communicator creation) passed!
  • #305 (update MPI_Intercomm_create to use collective tag space) passed