Torsten Hoefler's blog | nothing spectacular

Jun202026

Stop Saying FLOPS: A Tiny Rant About a Very Real Unit Problem

A modest proposal for saving scientific computing from typographic entropy.

(This post is a summary of the discussion of rigorous performance notation and benchmarking methodology in our paper Scientific Benchmarking of Parallel Computing Systems (Section 2.1.2) and the common criminal misuse of terms in our communities.)

High-performance computing has given humanity many beautiful things: weather forecasts, protein simulations, large-scale machine learning, and the ability to turn an electricity bill into a leaderboard entry. It has also given us one small but surprisingly persistent unit-writing mess: FLOP, FLOPS, flop, flops, and the mysterious creature known as flop/s.

The problem is not that people do not know what they mean. Usually they do. The problem is that they often write it in a way that mixes up a count of floating-point operations with a rate of floating-point operations per second. That is rather like mixing up joules and watts, except with more GPUs and fewer household appliances.

So let us ask a deliberately pedantic question: what if “floating point operation” were treated as a proper unit? If it were, the clean notation would be simple:

flop for a count of floating-point operations.
flop/s for floating-point operations per second.
Not FLOP, not FLOPS, and not the informal plural flops.

This is not because lowercase letters are morally superior, although some metrologists may quietly suspect so. It is because scientific notation works best when symbols are stable, unambiguous, and boring in exactly the right way.

The unit-symbol rule: symbols are not abbreviations

The International System of Units, maintained by the BIPM, exists to make measurement communication consistent across science, technology, industry, and trade. The BIPM publishes the SI Brochure as the official reference for the system. NIST summarizes the relevant writing rules very clearly: unit symbols are written in lowercase unless they are derived from a person’s name, such as W for watt or Pa for pascal.

That rule alone already makes FLOP suspicious. If flop were a unit symbol, what famous scientist named “Flop” are we honoring? Florence L. O. Processor? Friedrich Ludwig Operand von Pipeline? No. There is no Person of Flop. Therefore, by analogy with ordinary SI-style unit symbols, the symbol should be lowercase: flop.

NIST also states that unit symbols are never pluralized: 250 mm means 250 millimeters, not 250 mms. This is the key point. If flop is treated as a unit symbol, then 10 flop, 10¹⁵ flop, and 1.2e21 flop are correct in the same way that 10 m, 10 kg, and 10 J are correct. The quantity changes; the symbol does not.

The distinction is not just stylistic. NIST’s SI guidance emphasizes that unit symbols are symbols, not ordinary English abbreviations. They obey mathematical-symbol rules, not plural-noun rules. In other words, a unit symbol is not a little word that needs an s when it becomes lonely in a crowd. It is a symbol. It remains calm.

“Watts”, “kgs”, and other small crimes against clarity

Consider a few everyday examples. In formal technical writing, we would normally write:

2000 W, not 2000 Watts when using the symbol.
2 kg, not 2 kgs.
10 km, not 10 kms.

There is a small nuance here: spelling out a unit name as an English noun can require a plural, so “2000 watts” is grammatically fine in prose. NIST explicitly says plural unit names are used when English grammar requires them. But unit symbols do not get plural endings: W, kg, and km remain unchanged regardless of the number.

This is why “many flop” is the right style if flop is the unit symbol. It may sound odd at first, just as “12 fish” sounds odd if you were expecting “fishes,” but metrology is not here to optimize for vibes. It is here to prevent ambiguity, preferably before the benchmark marketing department discovers it.

The real confusion: flop versus flop/s

The strongest reason to avoid flops is that it collides with the traditional acronym FLOPS, commonly expanded as “floating-point operations per second.” Industry articles still explain FLOPS as a rate, not the plural of a single operation. That means “flops” is a dangerously overloaded spelling: it can be read as a plural count of operations or as a rate of operations per second. Those are not the same quantity.

A count and a rate differ by a dimension of time. 10¹⁸ flop is a number of operations. 10¹⁸ flop/s is a speed. Confusing them is like confusing “I drove 100 km” with “I drove 100 km/h.” One tells you how far you went. The other tells you how fast you were going. Only one of them will impress the traffic police.

The ambiguity is not theoretical. AMD’s ROCm documentation notes that terms such as peak FLOPs, max-achievable FLOPs, and delivered FLOPs have historically been used interchangeably, creating confusion and incorrect comparisons. The same source defines peak FLOPs through a formula involving frequency, cores, and operations per cycle, i.e., a rate-like hardware capability rather than a mere count of operations.

This is exactly why flop and flop/s are preferable. They make the dimensional distinction visible:

3.4e12 flop: a workload size, operation count, or algorithmic cost.
3.4e12 flop/s: a performance rate.

The slash is doing real work. It is not decoration. It is the tiny typographic hinge on which dimensional sanity swings.

But isn’t “FLOP” an acronym?

Historically, yes: people often write FLOP for “floating-point operation” and FLOPS for “floating-point operations per second.” That convention is widespread. It is also exactly the convention that creates the mess.

If we are writing ordinary English prose, acronyms are fine. But if we are writing values of quantities, it is better to behave like we are using units. NIST’s SI guidance recommends expressing quantity values using Arabic numerals paired with unit symbols. Once we decide that a floating-point operation count is a measurable quantity, the natural notation is not acronym style but unit-symbol style.

That means:

Use flop when counting floating-point operations.
Use flop/s when reporting floating-point operations per second.
Avoid FLOP because unit symbols are not capitalized unless derived from names.
Avoid flops because unit symbols are not pluralized and because it is visually confusable with flop/s.
Avoid FLOPS when you can write the dimensionally explicit flop/s.

Recommended style guide

Meaning	Recommended	Avoid	Why
One floating-point operation	`1 flop`	`1 FLOP`	Lowercase follows unit-symbol style unless derived from a proper name.
Many floating-point operations	`10⁹ flop`	`10⁹ flops`	Unit symbols are not pluralized.
Floating-point operation rate	`10⁹ flop/s`	`10⁹ FLOPS`, `10⁹ flops`	`flop/s` explicitly shows “operations per second” and avoids count/rate ambiguity.
Metric-prefixed rate	`1 Tflop/s`	`1 TFLOPS`, `1 teraflops`	SI prefixes attach directly to symbols without spaces, and symbols are not pluralized.

A few examples

Instead of:

The model required 3e23 FLOPs to train and achieved 500 TFLOPS.

Write:

The model required 3e23 flop to train and achieved 500 Tflop/s.

Instead of:

This kernel performs two FLOPS per cycle.

Write:

This kernel performs 2 flop/cycle.

Instead of:

We counted 100 megaFLOPs.

Write:

We counted 100 Mflop.

The point is not to win an argument at the typography conference, although that is a noble side quest. The point is to make the units carry the meaning precisely.

The punchline

If floating-point operations were treated as a real unit, the notation would be beautifully simple:

Use flop for work, and flop/s for speed.

Everything else is either shouting, pluralizing a symbol, or hiding a division by seconds in an acronym. FLOP looks like an acronym. FLOPS looks like a plural and a rate at the same time. flops looks casual, but it quietly smuggles ambiguity into the room. And ambiguity, like thermal throttling, usually arrives just when the benchmark is about to get interesting.

So let us be precise. Let us be boring in the way good notation is boring. Let us write:

1 flop
10¹⁵ flop
10¹⁵ flop/s

Your readers will understand the difference between a count and a rate. Your units will stop wearing unnecessary capital letters. And somewhere, perhaps, a metrologist will smile quietly into a perfectly lowercase cup of coffee.

For readers interested in the broader question of how to measure, report, and compare parallel computing performance rigorously, the paper Scientific Benchmarking of Parallel Computing Systems contains a much more detailed treatment of benchmarking methodology, performance analysis, and the many ways seemingly simple metrics can mislead.

Feb162020

How AT&T charged me up to $7.50 per second ($450/minute) and says it’s completely normal

blogUncategorized

I am traveling to the US often enough that it pays off to have a local phone number. But I’m not there often enough to have a monthly subscription. So AT&T’s prepaid (“pay as you go”) offering seemed like a reasonable choice. Indeed it was until last December. The deal is as follows: you pay $2/day for phone service and $1/100 MiB data service.

I’m using a Galaxy Note phone and get 4G LTE connection – so these 100 MB could be gone in less than 30 seconds with AT&T’s fast (~30 MBit/s) network. Indeed, I saw such speeds in practice – kudos to the operator!

This was no big deal last year because after the 100 MiB data package was gone, the network simply disconnected and asked me to purchase another one. In fact, this is written in AT&T’s terms and conditions as “A Data Add-On is available for an additional charge for all device types. Once Data Add-On is depleted, basic/messaging phone users will automatically revert to pay-per-use data. To continue using data on a smartphone, customer must purchase another Data Add-On or be restricted to Wi-Fi.”.

But unfortunately, in December 2019, somehow that mode changed and instead of disconnecting me, the service automatically switched to the pay-per-use mode, which charges 1 cent per 5 kiB. And of course, the text message that tells you that your data is depleted arrives only hours after the fact, actually, about half the time it doesn’t arrive at all. So without knowing, I suddenly paid $7.50 per second (!!) if you use data and your connection is good: 30 Mbit/s = 3.75 MiB/s = 3,750 kiB/s = 750 * 5 kiB/s = $7.50 per second. That is up to $450 per minute! Fortunately, I did not enable AutoPay of my prepaid account ;-).

Of course, this is unworkable in practice. The first time this happened, it sucked my prepaid account from $55 down to zero in seconds. I called and complained about it and got a $30 refund but the problem was not fixed. I went to AT&T stores twice and they were somewhat helpful but the software quality of their internal system was so bad that it was constantly loosing the browser session (the system uses some web-based API). All they found was a cryptic entry “OCSPPTK” on December 5th – about the time when the issues started. But nobody knew that that meant. So they advised me to call support.

Then, I lived with the danger – but it again triggered yesterday (some app on my phone went crazy and updated itself) and boom – $30 gone, again down to zero. I called and discussed the issue for about one hour and was eventually told that I don’t know what I’m talking about and this is perfectly normal. The person was actually quoting the text above to me but we did not agree on the interpretation (so that it should turn off service for a smart phone).

Unfortunately, I have to say that the customer support is not impressive – they talk to me as if I’m somewhat of a child who does not understand what data means (“you should not use instagram” etc. – I have never used it …). I told them that I’m teaching computer science at one of the world’s highest ranked departments but to no avail. The supervisor was most annoying in that he didn’t even try to understand my problem but immediately went into script mode telling me how I’m wrong. When I directly asked him whether he wanted to actually help me or just explain me how charging $7.50 per second is normal business practice, he was quite clear that his goal was the latter. So not “customer support” but “customer repel” – if they would have just hung up, then it would have been much less frustrating. For the record, this person’s identifier was qpzvjw8.

I assume this is just a bug in my account and may be simple to fix. Unfortunately, the interface that AT&T provides makes it impossible to work with.

My goal was to get at least parts of the $55 I lost over this back and, more importantly, fix my account to disconnect once the 100 MiB are depleted, like I believe the terms and conditions above imply. This would have been quite simple for them but well, my problem is still no solved and I will have to file a complaine with the FCC.

Oct162019

SPCL’s papers at SC19

blogUncategorized

SPCL is back at SC – after a forceful break last year where we could not submit because I was technical papers co-chair, we celebrate our comeback this year (with some backlog from las of course). The SPCL team will have eight papers in the main track and many other appearances (co-organizing workshops, several workshop papers, two tutorials, panels, BoFs, Gordon Bell finalist etc.). But the highlight is still the main track contributions. Here is an overview in a random order:

1) Tal Ben-Nun, Johannes de Fine Licht, Alexandros Nikolaos Ziogas, Timo Schneider, Torsten Hoefler: “Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures”

We present Stateful Dataflow Multigraphs (SDFGs), a new intermediate representation that can generate top-performing code for CPU, GPU, and FPGA. The SDFG is a data-centric intermediate representation that enables separating program definition from its optimization, thereby allowing performance engineers to interactively optimize applications independently from the source code. Several languages and frameworks can be compiled to SDFGs, including Python (with a numpy-like interface), TensorFlow, and a subset of MATLAB. We show that SDFGs deliver competitive performance on a wide variety of applications — from fundamental kernels to graph analytics — allowing domain scientists to develop applications naturally and port them to approach peak hardware performance, without sacrificing the clarity of the original code.

2) Grzegorz Kwasniewski, Marko Kabic, Maciej Besta, Joost VandeVondele, Raffaele Solcà, Torsten Hoefler “Red-Blue Pebbling Revisited: Near Optimal Parallel Matrix-Matrix Multiplication”

Starting from the red-blue pebble game abstraction, we established a constructive proof for I/O lower bounds of sequential and parallel matrix multiplication for all combinations of parameters – matrix dimensions, memory sizes, and number of processors. Combined with a series of implementation optimizations, we were able to outperform state-of-the art, highly tuned libraries like ScaLAPACK and CTF in all scenarios by a factor of 2.2x on average.

3) Maciej Besta, Simon Weber, Lukas Gianinazzi, Robert Gerstenberger, Andrey Ivanov, Yishai Oltchik, Torsten Hoefler: “Slim Graph: Practical Lossy Graph Compression for Approximate Graph Processing, Storage, and Analytics”

We developed Slim Grap, the first programming model and framework for practical lossy graph compression. Slim Graph is the result of the analysis of more than 500 existing papers on graph compression, and it enables expressing major graph compression classes such as spanners, spectral sparsifiers, edge sampling, or lossy summarizations. Within Slim Graph, we propose a class of graph compression called Triangle Reduction that enables preserving different graph properties, based on user needs. We also provide metrics for assessing the quality of lossy graph compression. Our design enables analyzing tradeoffs between performance, storage, and accuracy in the context of approximate graph processing, storage, and analytics.

4) Di Girolamo, Taranov, Kurth, Schaffner, Schneider, Beranek, Besta, Benini, Roweth, Hoefler “Network-Accelerated Non-Contiguous Memory Transfers”

We keep exploring the network stream processing model with a work on offloading MPI Derived Datatypes (DDT) processing to the network cards with sPIN. With network-accelerated DDTs, the NIC writes the data directly in its final position in the receive buffer, avoiding additional copies. We achieve up to 10x speedups for real-application DDTs, saving up to 3.8x memory traffic on the host.

5) De Sensi, Di Girolamo, Hoefler “Mitigating Network Noise on Dragonfly Networks through Application-Aware Routing”

A large part of application performance variability on Dragonfly networks is caused by the adaptive routing algorithm. We designed and implemented a software-only solution to automatically tune the routing algorithm according to the application characteristics. We validated our solution on microbenchmarks and real-world applications on both the Piz Daint and Cori supercomputers, showing a significant reduction in performance variability, and up to 2x speedup.

6) De Matteis, de Fine Licht, Beránek, Hoefler: “Streaming Message Interface: High-Performance Distributed Memory Programming on Reconfigurable Hardware”

We propose the Streaming Message Interface (SMI), a communication model and API for distributed memory programming in multi-FPGA systems. SMI unifies message passing with a hardware-oriented programming model: instead of bulk transmission, messages are streamed across the network during computation, allowing communication to be seamlessly integrated into pipelined FPGA designs. We present a high-level synthesis implementation of SMI targeting a dedicated FPGA interconnect, exposing runtime-configurable routing with support for arbitrary network topologies, and implement a set of distributed memory benchmarks.

7) Ziogas, Ben-Nun, Fernández, Schneider, Luisier, Hoefler “Optimizing the Data Movement in Quantum Transport Simulations via Data-Centric Parallel Programming”

Using the Data-centric Parallel Programming (DAPP) framework, we optimized the quantum transport simulator OMEN, a two-times Gordon Bell prize finalist. The data-centric viewpoint facilitated modeling the performance and communication attributes of the application, based on its coarse and fine-grained data-movement characteristics. The optimizations uncovered in the process led to two-orders of magnitude performance improvement, achieving sustained 85.45 Pflop/s on 4,560 nodes of Summit (42.55% of the peak) in double precision, and 90.89 Pflop/s in mixed precision.

8) Renggli, Alistarh, Aghagolzadeh, Hoefler “SparCML: High-Performance Sparse Communication for Machine Learning”

In a collaboration with IST Austria and Microsoft, we developed a communication library taking advantage of sparsity in machine learning and deep learning workloads. By only sending the top-k largest components of the gradient vector, we save up to 95% of the communication leading to speedups of up to 20x. We also showed how to optimize the training of a large speech assistant workload at Microsoft from weeks to days.

Preprints will be available on our publications webpage and on arXiv soon!

Jun72019

The Swiss railway (SBB) in the age of the Internet :-( – experiences of a criminalized business traveler

blogUncategorized

This blog post purely reflects my personal opinions even though it happened during business travel on behalf of ETH, it has nothing to do with the university. Yet, one may assume that a professor of computer science is technically savy (or not?).

I used to be the biggest fan of SBB since I came to Switzerland nearly a decade ago. So much that I moved quite far out of Zurich and commute every day for more than an hour (mountains are nice!). The trains are great for getting work done, the REs have cell phone repeaters and reasonable Internet, and they’re usually on time. Yet, for the first time in seven years something really annoying, and frankly, unacceptable, happened :-(. It seems like this has been a problem for some time.

I needed a new phone which I got last week and set everything up as usual, including the SBB app to buy tickets. I also logged into my “Swisspass” account so that all details are set like they used to be on the old phone (including my payment information – at least that’s what worked for ALL other apps). I felt safe to take the train to the airport the next day. After all, I have been using this app for years now and quite liked it. It was innovative (the touch scheduling is amazing!) and reliable.

Well, it turns our that it’s not reliable at all, in a really unexpected way. The UI design is remarkably bad: once you click on “buy ticket” (which is the last step to buy a ticket, so you CANNOT test it without actually buying a ticket), it suddenly requires an additional login. This was not stated at all why my Swisspass login from before suddenly doesn’t work even though it is THE SAME PAGE!? Ok, well, I enter the password and it doesn’t work (yes, the same password that worked the day before)? Fine, I try a password reset and the email is not delivered.

After tinkering again and again, my train is of course arriving and I need to take it to connect to my flight to the US! So I get nervous, run to the ticket machine which was occupied by a group of asians (nothing against those fellows, I saw myself in China and the SBB machines are really not simple to use). So this didn’t work. I had to board the train or risk missing my plane ($1.5k value)! The plan was to use the old phone in my bag through tethering and buy the ticket on the train.

Of course, the moment I open the SBB app on the old phone, somebody comes to check my ticket. I explain everything and even provide evidence (I demonstrate the swisspass login problems, I show my old phone where I was in the process of buying the ticket etc.). No mercy, I was an immediate criminal. Gosh! I iterated and was told that I need to argue this with the service center.

The call with the service center was not too helpful – the person did not quite understand what I was talking about (do these people know this app? Have they ever used it? What is the training?). I nevertheless got a 40% reduction of the penalty but am still marked in their system for an indefinite time. So I don’t only pay 7x the transportation cost out of my pocket but am also considered suspicious by SBB just because their app is absolutely silly!

Also, there is NO way to check whether it works (i.e., debug) without buying an actual ticket!!! Now my account seems to be locked completely (even though my Swisspass login in the app is fine!!??). On the way back, I still didn’t have it working and needed paper tickets (the lines of asians seem a recurrent theme these days so I missed my train this morning) :-(.

Note that there is no way to contact them provided (no phone, no email). The Swisspass webpage is even less helpful. What does “Swiss Pass Contact Center” even mean!? At least google doesn’t know about it. I have no clue how to resolve this! My time investment in the seemingly simple issue of having a new phone begins to be annoyingly high.

This begins to look like fraud – “everything looks good but surprise, you won’t get a ticket today!”. The SBB app, definitely nothing for busy business travelers with little time to tinker around :-(. I also lost all my trust in it, highly unprofessional.

Nov82018

Twelve ways to fool the masses when reporting performance of deep learning workloads

blogUncategorized

Twelve ways to fool the masses when reporting performance of deep learning workloads
Torsten Hoefler

Due to it’s wide-spread success in many hard machine learning tasks, deep learning quickly became one of the most important demanding compute workloads today. In fact, much of the success of deep learning stems from the high compute performance of today’s devices (and the massive amounts of data available). Despite the high compute capabilities, important tasks can take weeks to train in practical settings. When it comes to improve the performance of deep learning workloads, the HPC community plays an important role — in fact, high-performance accelerators as well as high-performance networks that enable the necessary massive parallel computation have both been developed and pioneered in the context of high-performance computing. The similarity of deep learning workloads and more traditional dense linear algebra — both expressable as tensor contractions (modulo some simple nonlinearities) — is striking.

It seems thus natural that the HPC community embarks in the endeavour to solve larger and larger learning problems in industrial and scientific contexts. We are just at the beginning of potential discoveries to be made by training larger and more complex networks to perform tasks at super-human capabilities. One of the most important aspects in HPC is, as the middle-name suggests, performance. Thus, many of the conferences, competitions, and science results focus on the performance aspects of a computation. Today, most of the performance improvement stems from growing parallelism, coming from wider vectorization, multi-threading, many-core, in the form of accelerators with massively parallel units, or large-scale parallelism at the cluster level. Accurately reporting and arguing about performance of today’s complex systems is a daunting task and requires scientific rigor, as explained in our earlier paper “Scientific Benchmarking for Parallel Computing Systems”.

Yet, in the machine learning community, the spotlight belongs is on the capability of a model to perform useful predictions and performance is mainly a catalyst. Learning workloads are usually of a statistical nature and thus relatively resistant to perturbations in the data and the computation. Thus, in general, one can trade off accuracy with performance. It’s trivially clear that one can train a model faster when using less data — however, the quality suffers. Many other, more intricate, aspects of training can be accelerated by introducing approximations, for example, to enable higher scalability. Many of these aspects are new to HPC and somewhat specific to (deep) learning and the HPC community may lack experience to assess performance results in this area.

I collected these thoughts over the last two years and was motivate to finalize them during the IPAM workshop “HPC for Computationally and Data-Intensive Problems” organized by Joachim Buhmann, Jennifer Chayes, Vipin Kumar, Yann LeCun, and Tandy Warnow. Thanks for the great discussions during the workshop (and sorry the discussion after that last evening talk took much longer than planned). I updated this post with thoughts brough up during the discussion and thank all participants!

Here, we report in a humorous way on some ways to “improve” ones performance results (“floptimization”) when reporting performance of deep learning workloads. Any similarity with existing papers or competitions is of course purely by chance :-)!

1) Ignore accuracy when scaling up!

Our first guideline to report highest performance is seemingly one of the most common one. Scaling deep learning is very tricky because the best performing optimizer, stochastic gradient descent (SGD), is mostly sequential. Model parallelism can be achieved by processing the elements of a minibatch in parallel — however, the best size of the minibatch is determined by the statistical properties of the process and is thus limited. However, when one ignores the quality (or convergence in general), the model-parallel SGD will scale wonderfully to any size system out there! Weak scaling by adding more data can benefit this further, after all we can process all that data in parallel. In practice, unfortunately, test accuracy matters, not how much data one processed.

One way around this may be to only report time for a small number of iterations because, at large scale, it’s too expensive to run to convergence, right?

2) Do not report test accuracy!

The SGD optimization method optimizes the function that the network represents to the dataset used for learning. This minimizes the so called training error. However, it is not clear whether the training error is a useful metric. After all, the network could just learn all examples without any capability to work on unseen examples. This is a classic case of overfitting. Thus, real-world users typically report test accuracy of an unseen dataset because machine learning is not optimization!

Yet, when scaling deep learning computations, one must tune many so called hyperparameters (batch size, learning rate, momentum, …) to enable convergence of the model. It may not be clear whether the best setting of those parameters benefits the test accuracy as well. In fact, there is evidence that careful tuning of hyperparameters may decrease the test accuracy by overfitting to a specific problem.

3) Do not report all training runs needed to tune hyperparameters!

Of course, hyperparameters heavily depend in the dataset and the network used for training. Thus, optimizing the parameters for a specific task will enable you to achieve highest performance. It’s not clear whether these parameter values are good for training any other model/data or if the parameters themselves are overfitted to the problem instance :-). Thus, after consuming millions of compute hours to tune specific hyperparameters, one simply reports the number of the fastest run!

4) Compare outdated hardware with special-purpose hardware!

A classic one, but very popular in deep learning: make sure to compare some old single-core CPU to your new GPU-tuned algorithm. Oh, and if you have specialized hardware then make sure to never compare to the latest available GPU but pick one from some years back. After all, that’s when you started developing, right?

5) Show only kernels/subsets when scaling!

Another classic that seems to be very popular. For example, run the operations (processing layers, communicating, updating gradients) in isolation and only report scaling numbers of those. This elegantly avoids questions about the test accuracy, after all, one just worries about a part of the calculation, no?

6) Do not consider I/O!

The third classic — deep learning often requires large amounts of data. Of course, when training on a large distributed system, only the computation matters, no? So loading all that data can safely be ignored :-).

7) Report highest ops numbers (whatever that means)!

Exaops sounds sexy, doesn’t it? So make sure to reduce the precision until you reach them. But what if I tell you that my laptop performs exaops/s if we consider the 3e9 transistors switching a binary digit each at 2.4e9 Hz? I have an exaops (learning) appliance and I’ll sell it for $10k! Basically the whole deal about low-precision “exaops” is a marketing stunt and should be (dis)regarded as such – flops have always been 64 bits and lowering the precision is not getting closer to the original target of exascale (or any other target). What’s even better is to mention “mixed precision” but never talk about what fraction of the workload was performed at what precision :-).

This is especially deceiving when talking about low precision flop/s – a nice high rate of course but we won’t talk about how many more of those operations are needed to achieve convergence as long as we have a “sustained” xyz-flop/s. It’s application progress, isn’t it?

8) Show performance when enabling option set A and show accuracy when enabling option set B!

From the discussion above, it’s obvious that readers may expect you to report both, accuracy and performance. One way to report highest performance is now to report performance for the best performance configuration and accuracy for the most accurate one.

One may think that this is an obvious no-no but I was surprised how many examples there are.

9) Train on unreasonably large inputs!

This is my true favorite, the pinnacle of floptimization! It took me a while to recognize and it’s quite powerful. The image classification community is almost used to scaling down high-resolution images to ease training. After all, scaling to 244×244 pixels retains most of the features and gains a quadratic factor (in the image width/hight) of computation time. However, such small images are rather annoying when scaling up because they require too little compute. Especially for small minibatch sizes, scaling is limited because processing a single small picture on each node is very inefficient. Thus, if flop/s are important then one shall process large, e.g., “high-resolution”, images. Each node can easily process a single example now and the 1,000x increase on needed compute comes nicely to support scaling and overall flop/s counts! A win-win unless you really care about the science done per cost or time.

In general, when procesing very large inputs, there should be a good argument why — one teraflop compute per example may be excessive.

10) Run training just for the right time!

When showing scalability with processors make sure to show training for a fixed wall-time. So you can cram twice as many flop/s on twice as many processors. Who cares about application/convergence speedup after all as long as we have flop/s? If your convergence plots behave oddly (e.g., diverge after some time), just cut them off at random points.

If this is all too complex, then just separate speedup plots from convergence plots. Show convergence plots for the processor counts where they look best and scalability plots to of course much larger numbers of processes! There are also many tricks when plotting number of epochs with varying batch size and varying numbers of processes (when the batch size changes the number of iterations).

In general, now seriously, convergence speed should always be bound to the number of operations (i.e., epochs or number of examples processed).

11) Minibatch sizing for fun and profit – weak vs. strong scaling.

We all know about weak vs. strong scaling, i.e., the simpler case when the input size scales with the number of processes and the harder case when the input size is constant. At the end, deep learning is all strong scaling because the model size is fixed and the total number of examples is fixed. However, one can cleverly utilize the minibatch sizes. Here, weak scaling keeps the minibatch size per process constant, which essentially grows the global minibatch size. Yet, the total epoch size remains constant, which causes less iterations per epoch and thus less overall communication rounds. Strong scaling keeps the global minbatch size constant. Both have VERY different effects in convergence — weak scaling worsens convergence eventually because it reduces stochasiticity and strong scaling does not.

In seriousness, however, microbatching that doesn’t change the statistical convergence properties is always fine.

12) Select carefully how to compare to the state of the art!

Last but not least, another obvious case: very often, deep learning is used as a replacement for an existing technique. If this is the case, you should only compare accuracy *or* performance. Especially if it’s unlikely that your model is good in both ;-).

Here are the slides presented at the IPAM workshop.

Sep272018

Nue Routing: fast, 100% fault-tolerant, 100% applicable, 100% deadlock-free

blogUncategorized

The OFA just released a new Open Subnet Manager version (v3.3.21) for InfiniBand, including many interesting features:

Support for HDR link speed and 2x link width
New routing algorithm: Nue routing
Support for ignoring of throttled links for Nue [1,2] and (DF)SSSP [3,4] routing
…and many more internal enhancements to OpenSM.

Nue Routing

Deadlock-freedom in general, but also the limited amount of virtual channels provided in modern interconnects, has been a long-standing problem for network researchers and engineers.
Nue routing is not just yet another new algorithm for statically routed high-performance interconnects, but a revolutionary step with respect to deadlock-freedom and fault-tolerance.

Our goal was to combine advantages of existing routing algorithms, primarily the flexibility of Up/Down routing and outstanding global path balancing of SSSP routing [5], while guaranteeing deadlock-freedom regardless of number of virtual channels/lanes or network type or size.
The incarnation of this effort, called Nue routing, derived from the legendary Japanese chimera, is the first algorithm capable of delivering high throughput, low latency, fast path calculation, and 100% guaranteed deadlock-freedom for any type of topology and network size.
All of this is enabled by the fundamental switch from calculating the routing within a graph representing the network to a new graph representation: the complete channel dependency graph.

Without going into detail about the inner workings, which can be found in our HPDC’16 publication [1] and Jens’ dissertation [2; Chapter 6], we will highlight Nue’s capabilities with the next two figures.

The figure below compares many existing routing algorithms of the OpenSM (we excluded MinHop and DOR, since these are only deadlock-free under certain constraints) to our Nue routing for a variety of network topologies, hosting roughly between 1000 and 2000 compute nodes each.
We have been using a cycle-accurate InfiniBand simulator to obtain these results.
Each bar represents the simulated communication throughput for a MPI_Alltoall operation (2KB payload per node) executed on all compute nodes of the topology, and hence a pretty accurate estimate of the capabilities of the network and how well the routing is able to utilize the available resources.
For many subgraphs only a subset of OpenSM’s routing engines are shown alongside Nue, because we filtered instances where the routing engine was not able to create valid routing tables.
Above each bar we list the amount of virtual channels this routing will consume to achieve a deadlock-free routing configuration.
Furthermore, the achievable network throughput under the given traffic pattern is shown for Nue routing with different numbers of virtual channels, ranging from 1 (equivalent to the absense of VCs) to 8.

nue-perf

In summary, the figure shows that Nue routing is competitive to the best performing routing for each individual topology, and offers between 84% for the 10-ary 3-tree and 121% throughput for the Cascade network in comparison.
Occasionally, depending on the given number of virtual channels, Nue is able to outperform the best competitor.
While our original design goals never included the ambition to beat each and every other routing on its home turf, we are glad to see that we can outperform most of them given a sufficient number of channels.
However, this figure also demonstrates the high flexibility w.r.t the given number of channels.
Take for example the Kautz network (left; middle row), were Nue can create a decent deadlock-free routing configuration without virtual channels, while DFSSSP needs 8 VCs and LASH needs at least 5 VCs, but Nue is also able to outperform both with just 5 VCs.

The next figure demonstrates Nue’s fault-tolerance as well as the relatively fast path calculation in comparison to other topology-agnostic routing engines (DFSSSP/LASH) and the topology-aware Torus2QOS engine.
For this test we used regular 3D tori networks of different sizes and randomly injected 1% switch-to-switch link failures into each topology.
The runtime for calculating all n-to-n paths in the network was measured for each routing engine and plotted, but only in cases where the engine was capable of producing a valid routing within the realistic 8VC constraint.

nue-runtime

Thanks to its O(n² * log n) runtime complexity and efficient implementation, Nue is starting to outperform DFSSSP and LASH with respect to runtime already for relatively small tori.
But more importantly, Nue can always create deadlock-free routing tables, while all other engines (even the semi-fault-tolerant and topology-aware Torus2QOS) eventually fail for larger networks.

Overall the advantages of Nue routing are manifold:

Allowing “fire-and-forget” approach for network administration, i.e., works 100% regardless of network failures which is ideal for fail-in-place networks
Low runtime and memory complexity (O(n² * log n) and O(n²), respectively)
Guaranteed deadlock-freedom and highly configurable in terms of VC usage
VCs not necessary for deadlock-freedom, which extends possible application to NoC and other interconnects which don’t support virtual channels
Completely topology-agnostic and yet very good path balancing under the given deadlock-freedom constraint
Support for QoS and deadlock-freedom simultaneously (both realized in InfiniBand via VCs)
Theoretically applicable to other (HPC) interconnects: RoCEv2, NoC, OPA, …

and everyone can now test and use Nue routing with the opensm v3.3.21 release by either choosing it via command line option:

--routing_engine nue [and optionally: --nue_max_num_vls <given #VCs>]

or via OpenSM configuration file:

routing_engine nue
nue_max_num_vls <given #VCs>

The default nue_max_num_vls for Nue is assumed to be equal to 1 to enforce deadlock-freedom even if QoS is not enabled.

For less advantageous admins ☺, or systems with specifically optimized routing, we still recommend to always use Nue as fallback (in case the primary routing fails) via:

routing_engine <primary>,nue

to ensure maximum fault-tolerance and uninterrupted operation of the system until the hardware failures are fixed (which is definitely better than the default fallback behavior to the deadlock-prone MinHop by OpenSM).

A more detailed description of OpenSM’s options for Nue is provided in the documentation and for more fine-grained control over the virtual channel configuration we recommend to read our previous blog post for the DFSSSP routing engine.
(Note: it is HIGHLY advised to install/use the METIS library with OpenSM (enforced via --enable-metis configure flag when building OpenSM) for improved path balancing in Nue.)

Avoiding throttled links

Our second new feature, we were able to push upstream, is designed to ease the job of system admins in case of temporary or long-term link degradation.

More often than one would wish, one or multiple links in large-scale InfiniBand installations get throttled from their intended speed (eg. 100Gbps EDR) to much lower speeds, like 8Gbps SDR.
While this IB feature is designed to keep the fabric and connectivity up, we argue that such a throttled link will be a major bottleneck to all application and storage traffic, and hence should be avoided.
Usually, HPC networks, especially fat-trees, have enough path-redundancy, such that moving all paths from the affected link(s) and distributing them to other links should have less performance degradation effects than keeping the link in low speed.
However, identifying, disabling, and ultimately replacing “bad” cables takes time.

So, we added a check to the SSSP, DFSSSP, and Nue routing engines to identify such degraded links, which prevents these routings from placing any path onto the links, essentially instantly “disabling” the link and issuing a warning in the logs for the system admin.
This feature can be turned on or off in the configuration file of the subnet manager by switching the avoid_throttled_links parameter to TRUE or FALSE, respectively.

Nue and DFSSSP were developed in collaboration between the main developer Jens Domke at the Matsuoka Laboratory, Tokio Institute of Technology, and Torsten Hoefler of the Scalable Parallel Computing Lab at ETH Zurich.
We would like to acknowledge Hal Rosenstock, the maintainer of OpenSM, who is always supportive of new ideas, and we greatly appreciated his comments and help during the integration of Nue into the official OpenSM.

[1]: J. Domke, T. Hoefler and S. Matsuoka: Routing on the Dependency Graph: A New Approach to Deadlock-Free High-Performance Routing
[2]: J. Domke: Routing on the Dependency Graph: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks (Dissertation)
[3]: J. Domke, T. Hoefler and W. Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
[4]: Our prev. DFSSSP blog post: DFSSSP: Fast (high-bandwidth) Deadlock-Free Routing for InfiniBand Networks
[5]: T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks

Jun242018

SPCL’s activities at ISC’18

blogUncategorized

Just a brief overview of SPCL’s (non-NDA) ongoing and upcoming activities at ISC’18.

1) We’re in the middle of the Advanced MPI Tutorial

With Antoni Pena from Barcelona Supercomputing Center, Tweet

2) Wednesday, 26.06., 11:15am, Talk: Automatic compiler-driven GPU acceleration with Polly-ACC

Part of the session “Challenges for Developing & Supporting HPC Applications” organized by Bill Gropp. (Related work)

3) Wednesday, 26.06., 1:45pm, Torsten organizes the session “Data Centric Computing” with speakers Anshu Dubey, Felix Wolf, John Shalf, and Keshav Pingali

4) Thursday, 28.06., 10:00am, Talk: High-level Code Transformations for Generating Fast Hardware
(Megabyte room)

At Post Moore’s Law HPC Computing (HCPM) workshop (Related work)

5) Thursday, 28.06., 12:20pm, Talk: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
(Gold 3 room)

At Workshop on the Convergence of Large Scale Simulation and Artificial Intelligence (Related work)

6) Thursday, 28.06., 3:20pm, Talk: A Network Accelerator Programming Interface
(Megabyte room)

At Post Moore Interconnects (Beyond CMOS) Workshop (Related work)

7) Thursday, 28.06., Panel: Performance Analysis and Instrumentation of Networks
(Basalt room)

At International Workshop on Communication Architectures for HPC, Big Data, Deep Learning and Clouds at Extreme Scale (Related work)

8) Friday, 29.06., European Processor Initiative (EPI) Steering Meeting

In addition to these public appearances, we’re involved in many meetings, vendor presentations, booth appearances, and other activities. Meet us around the conference and booths!

Jan272018

SC18’s improved reviewing process – call for papers and comments

blogUncategorized

Disclaimer: This blog post is not binding for the SC18 submission process. It attempts to explain the background and history of the innovations. For authoritative answers regarding the process, authors MUST refer to the SC18 webpage and FAQ!

What many of us know can also be shown with numbers: The SC conference is the most prestigious conference in High Performance Computing (HPC). It is listed as rank 6 in the “Computing Systems” Category in Google Scholar’s Metrics (with H-index 47 on January 21st 2018). It is only topped by TPDS, FGCS, NSDI, ISCA, and ASPLOS and thus the highest ranked HPC conference! The next one is arguably PPoPP with H-index 37 and rank 20.

The SC conference routinely attracts more than 10,000 attendees and nearly 50% indicated in a representative survey that attending technical presentations was within their top-3 activities. This makes it definitely the HPC conference where speakers reach the largest audience. I speak from experience: my talk at SC17 probably had more than 400 listeners in the audience and its twitter announcement quickly surpassed 10,000 views. So it definitely is the conference where big things start.

This year, I am honored to be SC18’s program chair, with the enormous help of my vice chair Todd Gamblin from LLNL. To make this great conference even greater, especially for authors and readers/attendees, we plan some major changes to the submission process: In addition to rebuttals, we introduce two different types of revisions during the submission. This allows the authors to address reviewer issues right within the paper draft while they may also add new data to support their discoveries. Rebuttals are still possible but will probably become less important because misunderstandings can be clarified right in the draft. Whether the paper is accepted or rejected, the authors will have an improved version. The revision process leads to an increased interaction between the committee and the authors, which eventually will increase the quality of the publications and talks at the conference. The overall process could be described as an attempt to merge the best parts of the journal review process (expert reviewers and revisions) with the conference review process (fixed schedule and quick turnaround).

This process has been tested and introduced to the HPC field by David Keyes and myself at the ACM PASC 2016 conference in Switzerland. We were inspired by top-class conferences in the field of architecture and databases but adopted their process to the HPC community. The established PASC review process motivated the addition of revisions for IPDPS 2018 (through the advocacy of Marc Snir). Now, we introduce similar improvements scaled to the Supercomputing conference series.

The key innovations of the PASC review process were (1) no standing committee (the committee was established by the chairs based on the submissions, similar to a journal); (2) fully double-blind reviews (not even the TPC chairs knew the identity of the authors); (3) short revisions of papers (the authors could submit revised manuscripts with highlighted changes), and (4) expert reviewers (the original reviewers were asked to suggest experts in the topic for a second round of reviews). The results are documented in a presentation and a paper.

My personal highlight was a paper in my area that improved its ranking drastically from the first to the second review because it was largely rewritten during the revision process. In general, the revision seemed highly effective as the statistics show: of the 105 first reviews, 19 improved their score by 1 point, and 2 improved it by two points in the second review. Points ranged from 1 (strong reject) to 5 (strong accept). These changes show how revisions improved many reviewer’s opinions of the papers and turned good papers into great papers. The revision even enabled the relatively high acceptance rate of 27% without compromising quality. The expert reviews also had a significant effect, which is analyzed in detail in the paper.

The Supercomputing conference has a long history and an order of magnitude more submissions and thus a much larger committee with a fixed structure spanning many areas. Furthermore, the conference is aligned to a traditional schedule. All this allows us to only adopt a part of the changes successfully tested at PASC. Luckily, double-blind reviews were already introduced in 2016 and 78% of the attendee survey preferred it over non double blind. Thus, we can focus our attention on introducing the revision process as well as the consideration of expert reviews.

Adopting the revision process to SC was not a simple task because schedules are set years in advance. For example, the deadline cannot be moved earlier than the end of March due to the necessary coordination with other top-class conferences such as ACM HPDC and ACM ICS (which is already tight, but doable, this year). We will also NOT grant the “traditional” one week extension. Let me repeat: there will be NO EXTENSIONS this year (like in many other top-class CS conferences). Furthermore, the TPC meeting has already been scheduled for the beginning of June and could not be moved for administrative reasons. The majority of the decisions have to be made during that in-person TPC meeting. We will also have to stay within the traditional acceptance rates of SC. We conclude that significant positive changes are possible within the limited options.

To fit the revision process into the SC schedule, we allow authors to submit a featherweight revision two weeks after receiving the initial reviews. This is a bit more time than for the rebuttal but may not be enough for a full revision. But the authors are free to prepare it before receiving the reviews. Even in the case of a later rejection I personally believe that improving a paper is useful. Each featherweight revision should be marked up with the changes very clearly (staying within the page limit). The detailed technology is left to the authors. In addition, the limited-length rebuttal could be used to discuss the changes. The authors need to keep in mind that the reviewers will have *very little* time (less than one week before the TPC meeting) to review the featherweight revision. In fact, they will have barely more time than for reviewing a rebuttal. So the more obvious the changes are marked and presented, the better are the chances for a reconsideration by the committee. Furthermore, due to these unfortunate time limitations, we cannot provide a second round of reviews for the featherweight revision (reviewers are free to amend their reviews but we cannot require them to). Nevertheless, we strongly believe that all authors can use this new freedom to improve their papers significantly. We are also trying to provide some feedback on the paper’s relative ranking to the authors if the systems allows this.

During the in-person TPC meeting, the track chairs will moderate the discussion of each paper and rank each in one of the following categories: Accept, Minor Revision, Major Revision, or Reject. An accepted paper is deemed suitable for direct publication in the SC proceedings; we expect the top 3-5% of the submitted papers to fall into that category. A Minor Revision is similar to a shepherded paper and is accepted with minor amendments, pending a final review of the shepherd; we expect about 10% of the submitted papers to fall into this category. This higher-than-traditional number of shepherded papers is consistent with top conferences in adjacent fields such as OSDI, NSDI, SOSP, SIGMOD etc.. The new grade is Major Revision, which invites the authors to submit a majorly changed paper within one month. A major revision typically requires additional results or analyses. We expect no more than 10% of the initial submissions to fall in this category, about 5% will be finally accepted (depending on the final quality). Major revision papers will be reviewed again and a final decision will be made during an online TPC discussion, moderated by the respective track chair. Finally, Rejected papers at any stage will not appear in the SC proceedings.

Regarding expert reviews, we may invite additional reviewers during any stage of the process. Thus, we ask authors to specify all strong conflicts (even people outside the current committee) during the initial submission. Furthermore, we are planning to have reviewers review reviews by the other reviewers to improve the quality of the process in the long run.

At the end of this discussion, let me place a shameless plug for efforts to improve performance interpretability :-): We hope that the state of performance reporting can be improved at SC18. While many submissions use excellent scientific methods for evaluating performance on parallel computing systems, some can be improved following very simple rules. I made an attempt to formalize a set of basic rules for performance reporting in the SC15 State-of-the-Practice paper “Scientific Benchmarking of Parallel Computing Systems”. I invite all authors to follow these rules to improve their submissions to any conference (they are of course NOT a prerequisite for SC18 but generally useful 😉 ).

We are very much looking forward to work with the technical papers team to make SC18 the best technical program ever and consolidate the leading position of the SC conference series in field of HPC. Please let me or Todd know if you have any comments, make sure to submit your best work to SC18 before March 28, and help us to make SC18 have the strongest paper track ever!

I want to especially thank David Keyes for advice and help during PASC’16, Todd Gamblin for the great support for the organization of SC18, and Bronis de Supinsky for ideas regarding the adoption of the PASC process to the SC18 conference. Most thanks goes to the track chairs and vice chairs that will support the implementation of the process during the SC18 paper selection process (in the order of the tracks): Aydin Buluc, Maryam Mehri Dehnavi, Erik Draeger, Allison Baker, Si Hammond, Madeleine Glick, Lavanya Ramakrishnan, Ioan Raicu, Rob Ross, Kelly Gaither, Felix Wolf, Laura Carrington, Pat McCormick, Naoya Maruyama, Bronis de Supinski, Ashley Barker, Ron Brightwell, and Rosa Badia. And last but not least the 200+ reviewers of the SC18 technical papers program!

Nov192017

SPCL activies at SC17

htorHPC • Science • US

SC17 is over, and even though it was my 10th anniversary, it wasn’t the best of the SC series. Actually, if you ask me personally, probably the worst but I promised to not discuss details here. Fortunately, I’ll be tech papers chair with Todd Gamblin as a vice next year, so we’ll make sure to remain purely technical. The SC series is and remains strong!

SPCL was again present in many areas across the technical program. Konstantin, Tobias, Salvatore, and I were involved in many things. Here are the thirteen most significant appearances:

1) Sunday: Torsten presented Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL and the COSMO Weather Code at the Intel HPC Developer’s conference

Room was packed and people were standing :-). Slides

2) Sunday: Salvatore presented LogGOPSim version 2 at the ExaMPI workshop

3) Monday: Tobias talks about “Improved Loop Distribution in LLVM Using Polyhedral Dependences” at the LLVM workshop [program]

4) Monday: Torsten co-presents the Advanced MPI Tutorial [program]

5) Monday: Torsten presents at the Early Career Panel how to publish [program]

6) Monday: Salvatore presents his work on SimFS at the PDSW workshop

7) Tuesday: Torsten presents the sPIN talk at the TiTech booth

8) Tuesday: Torsten talks at the 25 years-of MPI and 20 years of OpenMP celebration at the Intel booth

MPI+MPI or MPI+OpenMP is the question :-).

9) Tuesday: Torsten appears at the SIGHPC annual members meeting as an elected member (slightly late due to the Intel celebration)

10) Tuesday: Konstantin presents his poster Unifying Replication and Erasure Coding to Rule Resilience in KV-Stores at the poster reception

11) Wednesday: Torsten presents the sPIN paper in the technical program

Room was full, unfortunately, the session chair’s clock was wrong, so we started 5 mins early and people streamed in late :-(. Sorry! But that was the smallest which was wrong with this …

12) Wednesday: Salvatore presents his poster on Virtualized Big Data: Reproducing Simulation Output on Demand as ACM SRC semi finalist

13) Thursday: Edgar presents the paper Scaling Betweenness Centrality Using Communication-Efficient Sparse Matrix Multiplication in the technical program

14) Friday: Torsten co-organizes the H2RC workshop

Triple room was packed (~150-200 people during the keynote).

Sep262017

Persistent Collective Operations in MPI-3 for free!

htorHPC • MPI • Science

dandelion
source: thewishwall.org

We discussed persistent collectives at the MPI Forum last week. It was a great meeting and the discussions were very insightful. I really like persistent collectives and believe that MPI implementors should support them!

In that context, I wanted to note that implementors can do this easily and elegantly in MPI-3 without any changes to the standard. We used this technique already in 2012 in the paper “Optimization Principles for Collective Neighborhood Communications”. But let me recap the idea here.

The key ingredients are communicators (MPI’s name for immutable process groups) and Info objects. Info objects are a mechanism for users to pass additional information about how he/she will use MPI to the library. Info objects are very similar to pragmas in C/C++. Some info strings are defined by the standard itself but MPI libraries may add arbitrary strings to it.

So one way to specify a persistent collective is now to duplicate the communicator to create a new name, e.g., my_persistent_comm. At this communicator, the user can specify a info object to make specific operations persistent, e.g., mympi_bcast_is_persistent. The MPI library is encouraged to choose a prefix specific to itself (in this case “mympi”).

The library can now set a flag on the communicator that is checked at broadcast calls whether they are persistent. By passing this info object, the user guarantees that the function arguments passed to the specific call (e.g., bcast) on this communicator will always be the same. Thus, the MPI library can make the call specific to the arguments (i.e., implement all optimizations possible for persistence) once it has seen the first invocation of MPI_Ibcast().

This interface is very flexible, one could even imagine various levels of persistence as defined in our 2012 paper: (1) persistent topology (this is implicit in normal and neighborhood collectives), (2) persistent message sizes, and (3) persistent buffer (sizes and addresses). We describe in the paper optimizations for each level. These levels should be considered in any MPI specification effort.

I agree that having some official support for persistence in the standard would be great but these levels and info arguments should at least be discussed as alternative. It seems like big parts of the MPI Forum are not aware of this idea (this is part of why I write this post 😉 ).

Furthermore, I am mildly concerned about feature-inflation in MPI. Adding more and more features that are not optimized because they are not used, because they have not been optimized, because they were not used …. maay not be the best strategy. Today’s MPIs are not great at asynchronous progression of nonblocking collectives, and the performance of neighborhood collectives and MPI-3 RMA is mostly unconvincing. maybe the community needs some time to optimize and use those features. At the 25 years of MPI symposium, it became clear that big parts of the community share a similar concern.

Keep the great discussions up!