Nue Routing: fast, 100% fault-tolerant, 100% applicable, 100% deadlock-free

The OFA just released a new Open Subnet Manager version (v3.3.21) for InfiniBand, including many interesting features:

  • Support for HDR link speed and 2x link width
  • New routing algorithm: Nue routing
  • Support for ignoring of throttled links for Nue [1,2] and (DF)SSSP [3,4] routing
  • …and many more internal enhancements to OpenSM.

Nue Routing

Deadlock-freedom in general, but also the limited amount of virtual channels provided in modern interconnects, has been a long-standing problem for network researchers and engineers.
Nue routing is not just yet another new algorithm for statically routed high-performance interconnects, but a revolutionary step with respect to deadlock-freedom and fault-tolerance.

Our goal was to combine advantages of existing routing algorithms, primarily the flexibility of Up/Down routing and outstanding global path balancing of SSSP routing [5], while guaranteeing deadlock-freedom regardless of number of virtual channels/lanes or network type or size.
The incarnation of this effort, called Nue routing, derived from the legendary Japanese chimera, is the first algorithm capable of delivering high throughput, low latency, fast path calculation, and 100% guaranteed deadlock-freedom for any type of topology and network size.
All of this is enabled by the fundamental switch from calculating the routing within a graph representing the network to a new graph representation: the complete channel dependency graph.

Without going into detail about the inner workings, which can be found in our HPDC’16 publication [1] and Jens’ dissertation [2; Chapter 6], we will highlight Nue’s capabilities with the next two figures.

The figure below compares many existing routing algorithms of the OpenSM (we excluded MinHop and DOR, since these are only deadlock-free under certain constraints) to our Nue routing for a variety of network topologies, hosting roughly between 1000 and 2000 compute nodes each.
We have been using a cycle-accurate InfiniBand simulator to obtain these results.
Each bar represents the simulated communication throughput for a MPI_Alltoall operation (2KB payload per node) executed on all compute nodes of the topology, and hence a pretty accurate estimate of the capabilities of the network and how well the routing is able to utilize the available resources.
For many subgraphs only a subset of OpenSM’s routing engines are shown alongside Nue, because we filtered instances where the routing engine was not able to create valid routing tables.
Above each bar we list the amount of virtual channels this routing will consume to achieve a deadlock-free routing configuration.
Furthermore, the achievable network throughput under the given traffic pattern is shown for Nue routing with different numbers of virtual channels, ranging from 1 (equivalent to the absense of VCs) to 8.


In summary, the figure shows that Nue routing is competitive to the best performing routing for each individual topology, and offers between 84% for the 10-ary 3-tree and 121% throughput for the Cascade network in comparison.
Occasionally, depending on the given number of virtual channels, Nue is able to outperform the best competitor.
While our original design goals never included the ambition to beat each and every other routing on its home turf, we are glad to see that we can outperform most of them given a sufficient number of channels.
However, this figure also demonstrates the high flexibility w.r.t the given number of channels.
Take for example the Kautz network (left; middle row), were Nue can create a decent deadlock-free routing configuration without virtual channels, while DFSSSP needs 8 VCs and LASH needs at least 5 VCs, but Nue is also able to outperform both with just 5 VCs.

The next figure demonstrates Nue’s fault-tolerance as well as the relatively fast path calculation in comparison to other topology-agnostic routing engines (DFSSSP/LASH) and the topology-aware Torus2QOS engine.
For this test we used regular 3D tori networks of different sizes and randomly injected 1% switch-to-switch link failures into each topology.
The runtime for calculating all n-to-n paths in the network was measured for each routing engine and plotted, but only in cases where the engine was capable of producing a valid routing within the realistic 8VC constraint.


Thanks to its O(n2 * log n) runtime complexity and efficient implementation, Nue is starting to outperform DFSSSP and LASH with respect to runtime already for relatively small tori.
But more importantly, Nue can always create deadlock-free routing tables, while all other engines (even the semi-fault-tolerant and topology-aware Torus2QOS) eventually fail for larger networks.

Overall the advantages of Nue routing are manifold:

  • Allowing “fire-and-forget” approach for network administration, i.e., works 100% regardless of network failures which is ideal for fail-in-place networks
  • Low runtime and memory complexity (O(n2 * log n) and O(n2), respectively)
  • Guaranteed deadlock-freedom and highly configurable in terms of VC usage
  • VCs not necessary for deadlock-freedom, which extends possible application to NoC and other interconnects which don’t support virtual channels
  • Completely topology-agnostic and yet very good path balancing under the given deadlock-freedom constraint
  • Support for QoS and deadlock-freedom simultaneously (both realized in InfiniBand via VCs)
  • Theoretically applicable to other (HPC) interconnects: RoCEv2, NoC, OPA, …

and everyone can now test and use Nue routing with the opensm v3.3.21 release by either choosing it via command line option:

--routing_engine nue   [and optionally: --nue_max_num_vls <given #VCs>]

or via OpenSM configuration file:

routing_engine nue
nue_max_num_vls <given #VCs>

The default nue_max_num_vls for Nue is assumed to be equal to 1 to enforce deadlock-freedom even if QoS is not enabled.

For less advantageous admins ☺, or systems with specifically optimized routing, we still recommend to always use Nue as fallback (in case the primary routing fails) via:

routing_engine <primary>,nue

to ensure maximum fault-tolerance and uninterrupted operation of the system until the hardware failures are fixed (which is definitely better than the default fallback behavior to the deadlock-prone MinHop by OpenSM).

A more detailed description of OpenSM’s options for Nue is provided in the documentation and for more fine-grained control over the virtual channel configuration we recommend to read our previous blog post for the DFSSSP routing engine.
(Note: it is HIGHLY advised to install/use the METIS library with OpenSM (enforced via --enable-metis configure flag when building OpenSM) for improved path balancing in Nue.)

Avoiding throttled links

Our second new feature, we were able to push upstream, is designed to ease the job of system admins in case of temporary or long-term link degradation.

More often than one would wish, one or multiple links in large-scale InfiniBand installations get throttled from their intended speed (eg. 100Gbps EDR) to much lower speeds, like 8Gbps SDR.
While this IB feature is designed to keep the fabric and connectivity up, we argue that such a throttled link will be a major bottleneck to all application and storage traffic, and hence should be avoided.
Usually, HPC networks, especially fat-trees, have enough path-redundancy, such that moving all paths from the affected link(s) and distributing them to other links should have less performance degradation effects than keeping the link in low speed.
However, identifying, disabling, and ultimately replacing “bad” cables takes time.

So, we added a check to the SSSP, DFSSSP, and Nue routing engines to identify such degraded links, which prevents these routings from placing any path onto the links, essentially instantly “disabling” the link and issuing a warning in the logs for the system admin.
This feature can be turned on or off in the configuration file of the subnet manager by switching the avoid_throttled_links parameter to TRUE or FALSE, respectively.

Nue and DFSSSP were developed in collaboration between the main developer Jens Domke at the Matsuoka Laboratory, Tokio Institute of Technology, and Torsten Hoefler of the Scalable Parallel Computing Lab at ETH Zurich.
We would like to acknowledge Hal Rosenstock, the maintainer of OpenSM, who is always supportive of new ideas, and we greatly appreciated his comments and help during the integration of Nue into the official OpenSM.

[1]: J. Domke, T. Hoefler and S. Matsuoka: Routing on the Dependency Graph: A New Approach to Deadlock-Free High-Performance Routing
[2]: J. Domke: Routing on the Dependency Graph: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks (Dissertation)
[3]: J. Domke, T. Hoefler and W. Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
[4]: Our prev. DFSSSP blog post: DFSSSP: Fast (high-bandwidth) Deadlock-Free Routing for InfiniBand Networks
[5]: T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks

SPCL’s activities at ISC’18

Just a brief overview of SPCL’s (non-NDA) ongoing and upcoming activities at ISC’18.

1) We’re in the middle of the Advanced MPI Tutorial

With Antoni Pena from Barcelona Supercomputing Center, Tweet

2) Wednesday, 26.06., 11:15am, Talk: Automatic compiler-driven GPU acceleration with Polly-ACC

Part of the session “Challenges for Developing & Supporting HPC Applications” organized by Bill Gropp. (Related work)

3) Wednesday, 26.06., 1:45pm, Torsten organizes the session “Data Centric Computing” with speakers Anshu Dubey, Felix Wolf, John Shalf, and Keshav Pingali

4) Thursday, 28.06., 10:00am, Talk: High-level Code Transformations for Generating Fast Hardware
(Megabyte room)

At Post Moore’s Law HPC Computing (HCPM) workshop (Related work)

5) Thursday, 28.06., 12:20pm, Talk: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
(Gold 3 room)

At Workshop on the Convergence of Large Scale Simulation and Artificial Intelligence (Related work)

6) Thursday, 28.06., 3:20pm, Talk: A Network Accelerator Programming Interface
(Megabyte room)

At Post Moore Interconnects (Beyond CMOS) Workshop (Related work)

7) Thursday, 28.06., Panel: Performance Analysis and Instrumentation of Networks
(Basalt room)

At International Workshop on Communication Architectures for HPC, Big Data, Deep Learning and Clouds at Extreme Scale (Related work)

8) Friday, 29.06., European Processor Initiative (EPI) Steering Meeting

In addition to these public appearances, we’re involved in many meetings, vendor presentations, booth appearances, and other activities. Meet us around the conference and booths!

SC18′s improved reviewing process – call for papers and comments

Disclaimer: This blog post is not binding for the SC18 submission process. It attempts to explain the background and history of the innovations. For authoritative answers regarding the process, authors MUST refer to the SC18 webpage and FAQ!

What many of us know can also be shown with numbers: The SC conference is the most prestigious conference in High Performance Computing (HPC). It is listed as rank 6 in the “Computing Systems” Category in Google Scholar’s Metrics (with H-index 47 on January 21st 2018). It is only topped by TPDS, FGCS, NSDI, ISCA, and ASPLOS and thus the highest ranked HPC conference! The next one is arguably PPoPP with H-index 37 and rank 20.

The SC conference routinely attracts more than 10,000 attendees and nearly 50% indicated in a representative survey that attending technical presentations was within their top-3 activities. This makes it definitely the HPC conference where speakers reach the largest audience. I speak from experience: my talk at SC17 probably had more than 400 listeners in the audience and its twitter announcement quickly surpassed 10,000 views. So it definitely is the conference where big things start.

This year, I am honored to be SC18′s program chair, with the enormous help of my vice chair Todd Gamblin from LLNL. To make this great conference even greater, especially for authors and readers/attendees, we plan some major changes to the submission process: In addition to rebuttals, we introduce two different types of revisions during the submission. This allows the authors to address reviewer issues right within the paper draft while they may also add new data to support their discoveries. Rebuttals are still possible but will probably become less important because misunderstandings can be clarified right in the draft. Whether the paper is accepted or rejected, the authors will have an improved version. The revision process leads to an increased interaction between the committee and the authors, which eventually will increase the quality of the publications and talks at the conference. The overall process could be described as an attempt to merge the best parts of the journal review process (expert reviewers and revisions) with the conference review process (fixed schedule and quick turnaround).

This process has been tested and introduced to the HPC field by David Keyes and myself at the ACM PASC 2016 conference in Switzerland. We were inspired by top-class conferences in the field of architecture and databases but adopted their process to the HPC community. The established PASC review process motivated the addition of revisions for IPDPS 2018 (through the advocacy of Marc Snir). Now, we introduce similar improvements scaled to the Supercomputing conference series.

The key innovations of the PASC review process were (1) no standing committee (the committee was established by the chairs based on the submissions, similar to a journal); (2) fully double-blind reviews (not even the TPC chairs knew the identity of the authors); (3) short revisions of papers (the authors could submit revised manuscripts with highlighted changes), and (4) expert reviewers (the original reviewers were asked to suggest experts in the topic for a second round of reviews). The results are documented in a presentation and a paper.

My personal highlight was a paper in my area that improved its ranking drastically from the first to the second review because it was largely rewritten during the revision process. In general, the revision seemed highly effective as the statistics show: of the 105 first reviews, 19 improved their score by 1 point, and 2 improved it by two points in the second review. Points ranged from 1 (strong reject) to 5 (strong accept). These changes show how revisions improved many reviewer’s opinions of the papers and turned good papers into great papers. The revision even enabled the relatively high acceptance rate of 27% without compromising quality. The expert reviews also had a significant effect, which is analyzed in detail in the paper.

The Supercomputing conference has a long history and an order of magnitude more submissions and thus a much larger committee with a fixed structure spanning many areas. Furthermore, the conference is aligned to a traditional schedule. All this allows us to only adopt a part of the changes successfully tested at PASC. Luckily, double-blind reviews were already introduced in 2016 and 78% of the attendee survey preferred it over non double blind. Thus, we can focus our attention on introducing the revision process as well as the consideration of expert reviews.

Adopting the revision process to SC was not a simple task because schedules are set years in advance. For example, the deadline cannot be moved earlier than the end of March due to the necessary coordination with other top-class conferences such as ACM HPDC and ACM ICS (which is already tight, but doable, this year). We will also NOT grant the “traditional” one week extension. Let me repeat: there will be NO EXTENSIONS this year (like in many other top-class CS conferences). Furthermore, the TPC meeting has already been scheduled for the beginning of June and could not be moved for administrative reasons. The majority of the decisions have to be made during that in-person TPC meeting. We will also have to stay within the traditional acceptance rates of SC. We conclude that significant positive changes are possible within the limited options.

To fit the revision process into the SC schedule, we allow authors to submit a featherweight revision two weeks after receiving the initial reviews. This is a bit more time than for the rebuttal but may not be enough for a full revision. But the authors are free to prepare it before receiving the reviews. Even in the case of a later rejection I personally believe that improving a paper is useful. Each featherweight revision should be marked up with the changes very clearly (staying within the page limit). The detailed technology is left to the authors. In addition, the limited-length rebuttal could be used to discuss the changes. The authors need to keep in mind that the reviewers will have *very little* time (less than one week before the TPC meeting) to review the featherweight revision. In fact, they will have barely more time than for reviewing a rebuttal. So the more obvious the changes are marked and presented, the better are the chances for a reconsideration by the committee. Furthermore, due to these unfortunate time limitations, we cannot provide a second round of reviews for the featherweight revision (reviewers are free to amend their reviews but we cannot require them to). Nevertheless, we strongly believe that all authors can use this new freedom to improve their papers significantly. We are also trying to provide some feedback on the paper’s relative ranking to the authors if the systems allows this.

During the in-person TPC meeting, the track chairs will moderate the discussion of each paper and rank each in one of the following categories: Accept, Minor Revision, Major Revision, or Reject. An accepted paper is deemed suitable for direct publication in the SC proceedings; we expect the top 3-5% of the submitted papers to fall into that category. A Minor Revision is similar to a shepherded paper and is accepted with minor amendments, pending a final review of the shepherd; we expect about 10% of the submitted papers to fall into this category. This higher-than-traditional number of shepherded papers is consistent with top conferences in adjacent fields such as OSDI, NSDI, SOSP, SIGMOD etc.. The new grade is Major Revision, which invites the authors to submit a majorly changed paper within one month. A major revision typically requires additional results or analyses. We expect no more than 10% of the initial submissions to fall in this category, about 5% will be finally accepted (depending on the final quality). Major revision papers will be reviewed again and a final decision will be made during an online TPC discussion, moderated by the respective track chair. Finally, Rejected papers at any stage will not appear in the SC proceedings.

Regarding expert reviews, we may invite additional reviewers during any stage of the process. Thus, we ask authors to specify all strong conflicts (even people outside the current committee) during the initial submission. Furthermore, we are planning to have reviewers review reviews by the other reviewers to improve the quality of the process in the long run.

At the end of this discussion, let me place a shameless plug for efforts to improve performance interpretability :-) : We hope that the state of performance reporting can be improved at SC18. While many submissions use excellent scientific methods for evaluating performance on parallel computing systems, some can be improved following very simple rules. I made an attempt to formalize a set of basic rules for performance reporting in the SC15 State-of-the-Practice paper “Scientific Benchmarking of Parallel Computing Systems”. I invite all authors to follow these rules to improve their submissions to any conference (they are of course NOT a prerequisite for SC18 but generally useful ;-) ).

We are very much looking forward to work with the technical papers team to make SC18 the best technical program ever and consolidate the leading position of the SC conference series in field of HPC. Please let me or Todd know if you have any comments, make sure to submit your best work to SC18 before March 28, and help us to make SC18 have the strongest paper track ever!

I want to especially thank David Keyes for advice and help during PASC’16, Todd Gamblin for the great support for the organization of SC18, and Bronis de Supinsky for ideas regarding the adoption of the PASC process to the SC18 conference. Most thanks goes to the track chairs and vice chairs that will support the implementation of the process during the SC18 paper selection process (in the order of the tracks): Aydin Buluc, Maryam Mehri Dehnavi, Erik Draeger, Allison Baker, Si Hammond, Madeleine Glick, Lavanya Ramakrishnan, Ioan Raicu, Rob Ross, Kelly Gaither, Felix Wolf, Laura Carrington, Pat McCormick, Naoya Maruyama, Bronis de Supinski, Ashley Barker, Ron Brightwell, and Rosa Badia. And last but not least the 200+ reviewers of the SC18 technical papers program!

SPCL activies at SC17

SC17 is over, and even though it was my 10th anniversary, it wasn’t the best of the SC series. Actually, if you ask me personally, probably the worst but I promised to not discuss details here. Fortunately, I’ll be tech papers chair with Todd Gamblin as a vice next year, so we’ll make sure to remain purely technical. The SC series is and remains strong!

SPCL was again present in many areas across the technical program. Konstantin, Tobias, Salvatore, and I were involved in many things. Here are the thirteen most significant appearances:

1) Sunday: Torsten presented Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL and the COSMO Weather Code at the Intel HPC Developer’s conference

Room was packed and people were standing :-) . Slides

2) Sunday: Salvatore presented LogGOPSim version 2 at the ExaMPI workshop

3) Monday: Tobias talks about “Improved Loop Distribution in LLVM Using Polyhedral Dependences” at the LLVM workshop [program]

4) Monday: Torsten co-presents the Advanced MPI Tutorial [program]

5) Monday: Torsten presents at the Early Career Panel how to publish [program]

6) Monday: Salvatore presents his work on SimFS at the PDSW workshop

7) Tuesday: Torsten presents the sPIN talk at the TiTech booth

8) Tuesday: Torsten talks at the 25 years-of MPI and 20 years of OpenMP celebration at the Intel booth

MPI+MPI or MPI+OpenMP is the question :-) .

9) Tuesday: Torsten appears at the SIGHPC annual members meeting as an elected member (slightly late due to the Intel celebration)

10) Tuesday: Konstantin presents his poster Unifying Replication and Erasure Coding to Rule Resilience in KV-Stores at the poster reception

11) Wednesday: Torsten presents the sPIN paper in the technical program

Room was full, unfortunately, the session chair’s clock was wrong, so we started 5 mins early and people streamed in late :-( . Sorry! But that was the smallest which was wrong with this …

12) Wednesday: Salvatore presents his poster on Virtualized Big Data: Reproducing Simulation Output on Demand as ACM SRC semi finalist

13) Thursday: Edgar presents the paper Scaling Betweenness Centrality Using Communication-Efficient Sparse Matrix Multiplication in the technical program

14) Friday: Torsten co-organizes the H2RC workshop

Triple room was packed (~150-200 people during the keynote).

Persistent Collective Operations in MPI-3 for free!


We discussed persistent collectives at the MPI Forum last week. It was a great meeting and the discussions were very insightful. I really like persistent collectives and believe that MPI implementors should support them!

In that context, I wanted to note that implementors can do this easily and elegantly in MPI-3 without any changes to the standard. We used this technique already in 2012 in the paper “Optimization Principles for Collective Neighborhood Communications”. But let me recap the idea here.

The key ingredients are communicators (MPI’s name for immutable process groups) and Info objects. Info objects are a mechanism for users to pass additional information about how he/she will use MPI to the library. Info objects are very similar to pragmas in C/C++. Some info strings are defined by the standard itself but MPI libraries may add arbitrary strings to it.

So one way to specify a persistent collective is now to duplicate the communicator to create a new name, e.g., my_persistent_comm. At this communicator, the user can specify a info object to make specific operations persistent, e.g., mympi_bcast_is_persistent. The MPI library is encouraged to choose a prefix specific to itself (in this case “mympi”).

The library can now set a flag on the communicator that is checked at broadcast calls whether they are persistent. By passing this info object, the user guarantees that the function arguments passed to the specific call (e.g., bcast) on this communicator will always be the same. Thus, the MPI library can make the call specific to the arguments (i.e., implement all optimizations possible for persistence) once it has seen the first invocation of MPI_Ibcast().

This interface is very flexible, one could even imagine various levels of persistence as defined in our 2012 paper: (1) persistent topology (this is implicit in normal and neighborhood collectives), (2) persistent message sizes, and (3) persistent buffer (sizes and addresses). We describe in the paper optimizations for each level. These levels should be considered in any MPI specification effort.

I agree that having some official support for persistence in the standard would be great but these levels and info arguments should at least be discussed as alternative. It seems like big parts of the MPI Forum are not aware of this idea (this is part of why I write this post ;-) ).

Furthermore, I am mildly concerned about feature-inflation in MPI. Adding more and more features that are not optimized because they are not used, because they have not been optimized, because they were not used …. maay not be the best strategy. Today’s MPIs are not great at asynchronous progression of nonblocking collectives, and the performance of neighborhood collectives and MPI-3 RMA is mostly unconvincing. maybe the community needs some time to optimize and use those features. At the 25 years of MPI symposium, it became clear that big parts of the community share a similar concern.

Keep the great discussions up!

SPCL Activities at SC16

After the stress of SC16 is finally over, let me summarize SPCL’s activities at the conference.

In a nutshell, we participated in two tutorials, two panels, the organization of the H2RC workshop, I gave three invited talks and my students and collaborators presented our four papers at the SC papers program. Not to mention the dozens of meetings :-) . Some chronological impressions are below:

1) Tutorial “Insightful Automatic Performance Modeling” with A. Calotoiu, F. Wolf, M. Schulz

2) Panel at Sixth Workshop on Irregular Applications: Architectures and Algorithms (IA^3)

I was part of a panel discussion on irregular vs. regular structures for graph computations.

The opening

Discussions :-)


3) Tutorial “Advanced MPI” with B. Gropp, R. Thakur, P. Balaji

I was co-presenting the long running successful tutorial on advanced MPI.

The section on collectives and topologies

4) Second International Workshop on Heterogeneous Computing with Reconfigurable Logic (H2RC) with Michaela Blott, Jason Bakos, Michael Lysaght

We organized the FPGA workshop for the second time, was a big success, people were standing in the back of the room. We even convinced database folks (here, my colleague Gustavo Alonso) to attend SC for the first time!

Gustavo’s opening

Full house

5) Invited talk at LLVM-HPC workshop organized by Hal Finkel

I gave a talk about Polly-ACC (Tobias Grosser’s work) at the workshop, quite interesting feedback!

Nice audience

Great feedback

6) Panel at LLVM-HPC workshop

Later, we had a nice panel about what to improve in LLVM to deal with new languages and/or accelerators.

7) SIGHPC annual member’s meeting

As elected member at large, I attended the annual members meeting at SC16.

8) Collaborator Jens Domke from Dresden presented our first paper “Scheduling-Aware Routing for Supercomputers

Huge room, nicely filled.

9) Booth Talk at Tokio Institute of Technology booth

Was an interesting experience :-) . First, you talk to two people, towards the end, there was a crowd. Even though most people missed the beginning, I got very nice questions.

10) Collaborator Bill Tang presented our paper “Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide

11) SPCL student Tobias Gysi presented our paper “dCUDA: Hardware Supported Overlap of Computation and Communication

12) Collaborator Maxime Martinasso presents our paper “A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

But as usual, it’s always the informal, sometimes even secret, meetings that make out SC’s experience. The two SPCL students Greg and Tobias did a great job learning and representing SPCL while I was running around between meetings. I am so glad I didn’t have to present any papers this year (i.e., that I could rely on my collaborators and students :-) ). Yet, it’s a bit worrying that my level of business (measured by the number of parallel meetings and overbooked calendar slots) is getting worse each year. Oh well :-) .

Keynote at HPC China and Public lecture at ETH on Scientific Performance Engineering in HPC

In the last two weeks I gave two presentations on scientific performance engineering, a theme that describes best what we do at my lab (SPCL) at ETH. The first lecture was a keynote at HPC China, the largest conference on High-Performance Computing in Asia (and probably the second largest world-wide). I have to say that this was definitely the best conference that I attended this year due to several reasons :-) .

Here an impression from the impressive conference.

Shortly after that, I presented a similar talk at my home university ETH Zurich as the last step in a long process ;-) . It was great as well — the room was packed (capacity ~250) and people who came late even complained that there were not enough seats — well, their fault, there were some in the front :-) .

Here some impressions from this important talk:

My department head Prof. Emo Welzl introducing the talk with some personal connections and overlapping interests

Some were even paying attention!

One of the larger lecture rooms in ETH’s main building

In case you missed it, I gave a longer version of the same talk at Cluster 2016 in Taipei (more content for free!).

SPCL barbequeue version 3 (beach edition)

The next iteration in our celebration of SPCL successes since January was completed successfully! This time (based on popular demand) with a beach component where students could swim, fight, and be bitten by interesting naval creatures.

We celebrated our successes at HPDC, ICS, HOTI, and SC16!

Even with some action — boats speeding by rather closely ;-) .

Later, we moved to a barbequeue place a bit up a hill to get some real meat :-) .

We first had to conquer the place — but eventually we succeeded (maybe seond somebody earlier next time to occupy it and start a fire).

We had 7kg of Swiss cow this time!

And a much more professional fireplace!

Of course, some studying was also involved in the woods — wouldn’t be SPCL otherwise.

Including the weirdest (e.g., “hanging”) competitions.

We were around 20 people and consumed (listed here for the next planning iteration):

  • 6 x 1.5l water – 1l left at the end
  • 18×0.33l beer, 6×0.5l beer – all gone (much already at the beach)
  • 7l wine
  • 2l vodka
  • 7kg cow meat (4.5kg steaks, 2kg cevapcici, 0.5kg sausage)
  • 2 large Turkish-style breads (too quickly gone)
  • 1 quiche (too quickly gone)
  • 12 American cookies + 16 scones (both home-made)
  • 3/4 large watermelon
  • 0.5kg dates
  • 2kg grapes, 3kg peaches, 5 cucumbers,
  • 0.5kg grill pepper, 1kg mushrooms

What are the real differences between RDMA, InfiniBand, RMA, and PGAS?

I often get the question how the concepts of Remote Direct Memory Access (RDMA), InfiniBand, Remote Memory Access (RMA), and Partitioned Global Address Space (PGAS) relate to each other. In fact, I see a lot of confusion in papers of some communities which discovered these concepts recently. So let me present my personal understanding here; of course open for discussions! So let’s start in reverse order :-) .

PGAS is a concept relating to programming large distributed memory machines with a shared memory abstraction that distinguishes between local (cheap) and remote (expensive) memory accesses. PGAS is usually used in the context of PGAS languages such as Co-Array Fortran (CAF) or Unified Parallel C (UPC) where language extensions (typically distributed arrays) allow the user to specify local and remote accesses. In most PGAS languages, remote data can be used like local data, for example, one can assign a remote value to a local stack variable (which may reside in a register) — the compiler will generate the needed code to imlement the assignment. A PGAS language can be compiled seamlessly to target a global load/store system.

RMA is very similar to PGAS in that it is a shared memory abstraction that distinguishes between local and remote memory accesses. RMA is often used in the context of the Message Passing Interface standard (even though it does not deal with passing messages ;-) ). So why then not just calling it PGAS? Well, there are some subtle differences to PGAS: MPI RMA is a library interface for moving data between local and remote memories. For example, it cannot move data into registers directly and may be subject to additional overheads on a global load/store machine. It is designed to be a slim and portable layer on top of lower-level data-movement APIs such as OFED, uGNI, or DMAPP. One main strength is that it integrates well with the remainder of MPI. In the MPI context, RMA is also known as one-sided communication.

So where does RDMA now come in? Well, confusingly, it is equally close to both PGAS and it’s Hamming-distance-one name sibling RMA. RDMA is a mechanism to directly access data in remote memories across an interconnection network. It is, as such, very similar to machine-local DMA (Direct Memory Access), so the D is very significant! It means that memory is accessed without involving the CPU or Operating System (OS) at the destination node, just like DMA. It is as such different from global load/store machines where CPUs perform direct accesses. Similarly to DMA, the OS controls protection and setup in the control path but then removes itself from the fast data path. RDMA always comes with OS bypass (at the data plane) and thus is currently the fastest and lowest-overhead mechanism to communicate data across a network. RDMA is more powerful than RMA/PGAS/one-sided: many RDMA networks such as InfiniBand provide a two-sided message passing interface as well and accelerate transmissions with RDMA techniques (direct data transfer from source to remote destination buffer). So RDMA and RMA/PGAS do not include each other!

What does this now mean for programmers and end-users? Both RMA and PGAS are programming interfaces for end-users and offer several higher-level constructs such as remote read, write, accumulates, or locks. RDMA is often used to implement these mechanisms and usually offers a slimmer interface such as remote read, write, or atomics. RDMA is usually processed in hardware and RMA/PGAS usually try to use RDMA as efficiently as possible to implement their functions. RDMA programming interfaces are often not designed to be used by end-users directly and are thus often less documented.

InfiniBand is just a specific network architecture offering RDMA. It wasn’t the first architecture offering RDMA and will probably not be the last one. Many others exist such as Cray’s RDMA implementation in Gemini or Aries endpoints. You may now wonder what RoCE (RDMA over Converged Ethernet) is. It’s simply an RDMA implementation over (lossless data center) Ethernet which is somewhat competing with InfiniBand as a wire-protocol while using the same verbs interface as API.

More precise definitions can be found in Remote Memory Access Programming in MPI-3 and Fault Tolerance for Remote Memory Access Programming Models. I discussed some ideas for future Active RDMA systems in Active RDMA – new tricks for an old dog.

How many measurements do you need to report a performance number?

The following figure from the paper “Scientific Benchmarking of Parallel Computing Systems” shows the completion times for multiple identical runs of a tuned version of the high-performance Linpack (HPL) on the same system. It illustrates how important correct measurements are. Here, one may report 77.4 Tflop/s but when repeating the benchmark see as little as 61.2 Tflop/s. It suggests that one should use sound statistics when reporting any performance result.

Computer science is often about measuring computer systems. Be it time, energy, or performance, all these metrics are often non-deterministic in real computer systems and a single measurement may or may not provide a reliable result. So if you are not sloppy when measuring your system, you will measure several executions and report an aggregate measure such as the arithmetic or geometric average or the median. Well, but now the question is: “how many is several”? And this is where it gets less clear.

Typically, “several” is defined very informally, so if the measurement is cheap (such as a network latency measurement), it can be 1,000 or even 1,000,000. If it’s expensive (such as full-scale supercomputer runs), we’re very quickly back to a single measurement. But does it make sense to define the number of measurements based on the execution cost? Of course not — it should depend on the variability of the data! Who would have thought that …?

Unfortunately, most benchmarkers do not take the data variability into account at all in practice. Why not? Isn’t that somewhat clear that one needs to? Yes, it is, but it’s also hard! But actually, it’s not that hard if one knows some basic statistics. The simplest way is to check if one has enough measurements for a given variability in the result. But how to assess the variability? Well, one needs to look at some samples — ah, a catch 22? I need samples to know how many samples I need? Yes, that is true — in fact, the more samples I have, the higher my confidence in the variability and the correctness of my reported number.

A simple technique to assess the confidence of my measurement (we are simplifying this somewhat here) is to compute the confidence interval. Confidence Intervals (CIs)
are a tool to provide a range of values that include the true mean with a given probability p depending on the estimation procedure. So if the measurement is 1 second and the 95% CI is the range [0.9;1.1] then there is a 95% probability that the true mean is within that interval. There are two basic types of CIs: (1) confidence intervals around the mean assuming a normal distribution and (2) nonparametric confidence intervals around the median without assumptions on the distribution. The former CI one is simplest to compute: [mean-t(n-1,p/2)/sqrt(n); mean+t(n-1,p/2)/sqrt(n)] where mean is the arithmetic mean, n is the number of samples, and t(x,p) is student’s t distribution with x degrees of freedom. So it’s easy to see that the interval quickly gets tighter when the number of samples grows. But which computing system generates measurements following a standard distribution, which means that it’s equally likely to become faster than slower. Well, my computers are certainly more often becoming slower than faster leading to a right-skewed distribution.

So how do we get to confidence intervals of non-normally distributed measurements? Well, first of all, if the data is not normally distributed, the average makes little sense as it will be skewed as well. So one usually reports the median (the n/2-th element in the sorted set of all n measurements) as the most likely value to be observed in practice. But how to get to our confidence interval? Since we cannot assume any distribution of the values, we work on the sorted set of measurements and call the rank-i value the ith value in the set. Now we identify rank floor((n-z(p/2)*sqrt(n))/2) to ceil(1+(n+z(p/2)*sqrt(n))/2) as the conservative CI which is commonly asymmetric as well.

So ok, we can now compute this CI as statistical measure of certainty of our reported median. Median what? Don’t we like averages? Well, again, averages are not too useful for non-normally distributed data *unless* you care about only an accumulation of many measurements, i.e., you only want to know how expensive 1,000 iterations are and you do not care about every single one. Well, if this is the case, just measure the 1,000. If you’re well-versed in statistics, you will now recognize the connection to the Central Limit Theorem :-) .

But now again, how many measurements do we actually need?? To answer this, we’d first need to define a needed level of certainty, for example 95%. Then, we define an accepted error in our reporting around the median, for example 1%. In other words, we would like to have enough measurements to be 95% sure that the real median is within 1% of our reported value. Hey, so we’re now back to a single reported value just together with a certainty! So how do we achieve this? Well, for normally distributed data in the case (1), one could compute the number of needed measurements. But that doesn’t work with real computers, so let’s skip this here. In the nonparametric case, no explicit formula is known to us, so we would need to recompute the confidence interval after each measurement (or a set of measurements) and we could stop measuring once the 95% CI is within the 1% interval around the mean.

Wow, so now we know how to *really* measure and report performance! In fact, in practice, we often need less than 1,000 measurements to reach a tight interval with high confidence. So if they’re cheap, we can as well do them and check afterwards of the statistics make sense. But what if we are running out of benchmarking budget before we reach the required accuracy — for example, each measurement takes a day and we only have four days but after four days, the CI is still wider than we’d like it to be? Well, bad luck! In that case, we can only report the wide CI and leave it up wo the reader/observer to conclude if our measurements make sense in his context.

I wish you happy (and correct) measuring! Torsten Hoefler

This blog post summarizes a part of the paper “Scientific Benchmarking of Parallel Computing Systems” which appeared at IEEE/ACM Supercomputing 2015. The full paper provides more insight and references around this topic and also the equation for the number of measurements assuming a normal distribution. The paper also establishes more rules for sound performance analyses that I may blog on later. Spread the word and cite the paper if you find these rules useful :-) .