SPCL is back at SC – after a forceful break last year where we could not submit because I was technical papers co-chair, we celebrate our comeback this year (with some backlog from las of course). The SPCL team will have eight papers in the main track and many other appearances (co-organizing workshops, several workshop papers, two tutorials, panels, BoFs, Gordon Bell finalist etc.). But the highlight is still the main track contributions. Here is an overview in a random order:
1) Tal Ben-Nun, Johannes de Fine Licht, Alexandros Nikolaos Ziogas, Timo Schneider, Torsten Hoefler: “Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures”
We present Stateful Dataflow Multigraphs (SDFGs), a new intermediate representation that can generate top-performing code for CPU, GPU, and FPGA. The SDFG is a data-centric intermediate representation that enables separating program definition from its optimization, thereby allowing performance engineers to interactively optimize applications independently from the source code. Several languages and frameworks can be compiled to SDFGs, including Python (with a numpy-like interface), TensorFlow, and a subset of MATLAB. We show that SDFGs deliver competitive performance on a wide variety of applications — from fundamental kernels to graph analytics — allowing domain scientists to develop applications naturally and port them to approach peak hardware performance, without sacrificing the clarity of the original code.
2) Grzegorz Kwasniewski, Marko Kabic, Maciej Besta, Joost VandeVondele, Raffaele Solcà, Torsten Hoefler “Red-Blue Pebbling Revisited: Near Optimal Parallel Matrix-Matrix Multiplication”
Starting from the red-blue pebble game abstraction, we established a constructive proof for I/O lower bounds of sequential and parallel matrix multiplication for all combinations of parameters – matrix dimensions, memory sizes, and number of processors. Combined with a series of implementation optimizations, we were able to outperform state-of-the art, highly tuned libraries like ScaLAPACK and CTF in all scenarios by a factor of 2.2x on average.
3) Maciej Besta, Simon Weber, Lukas Gianinazzi, Robert Gerstenberger, Andrey Ivanov, Yishai Oltchik, Torsten Hoefler: “Slim Graph: Practical Lossy Graph Compression for Approximate Graph Processing, Storage, and Analytics”
We developed Slim Grap, the first programming model and framework for practical lossy graph compression. Slim Graph is the result of the analysis of more than 500 existing papers on graph compression, and it enables expressing major graph compression classes such as spanners, spectral sparsifiers, edge sampling, or lossy summarizations. Within Slim Graph, we propose a class of graph compression called Triangle Reduction that enables preserving different graph properties, based on user needs. We also provide metrics for assessing the quality of lossy graph compression. Our design enables analyzing tradeoffs between performance, storage, and accuracy in the context of approximate graph processing, storage, and analytics.
4) Di Girolamo, Taranov, Kurth, Schaffner, Schneider, Beranek, Besta, Benini, Roweth, Hoefler “Network-Accelerated Non-Contiguous Memory Transfers”
We keep exploring the network stream processing model with a work on offloading MPI Derived Datatypes (DDT) processing to the network cards with sPIN. With network-accelerated DDTs, the NIC writes the data directly in its final position in the receive buffer, avoiding additional copies. We achieve up to 10x speedups for real-application DDTs, saving up to 3.8x memory traffic on the host.
5) De Sensi, Di Girolamo, Hoefler “Mitigating Network Noise on Dragonfly Networks through Application-Aware Routing”
A large part of application performance variability on Dragonfly networks is caused by the adaptive routing algorithm. We designed and implemented a software-only solution to automatically tune the routing algorithm according to the application characteristics. We validated our solution on microbenchmarks and real-world applications on both the Piz Daint and Cori supercomputers, showing a significant reduction in performance variability, and up to 2x speedup.
6) De Matteis, de Fine Licht, Beránek, Hoefler: “Streaming Message Interface: High-Performance Distributed Memory Programming on Reconfigurable Hardware”
We propose the Streaming Message Interface (SMI), a communication model and API for distributed memory programming in multi-FPGA systems. SMI unifies message passing with a hardware-oriented programming model: instead of bulk transmission, messages are streamed across the network during computation, allowing communication to be seamlessly integrated into pipelined FPGA designs. We present a high-level synthesis implementation of SMI targeting a dedicated FPGA interconnect, exposing runtime-configurable routing with support for arbitrary network topologies, and implement a set of distributed memory benchmarks.
7) Ziogas, Ben-Nun, Fernández, Schneider, Luisier, Hoefler “Optimizing the Data Movement in Quantum Transport Simulations via Data-Centric Parallel Programming”
Using the Data-centric Parallel Programming (DAPP) framework, we optimized the quantum transport simulator OMEN, a two-times Gordon Bell prize finalist. The data-centric viewpoint facilitated modeling the performance and communication attributes of the application, based on its coarse and fine-grained data-movement characteristics. The optimizations uncovered in the process led to two-orders of magnitude performance improvement, achieving sustained 85.45 Pflop/s on 4,560 nodes of Summit (42.55% of the peak) in double precision, and 90.89 Pflop/s in mixed precision.
8) Renggli, Alistarh, Aghagolzadeh, Hoefler “SparCML: High-Performance Sparse Communication for Machine Learning”
In a collaboration with IST Austria and Microsoft, we developed a communication library taking advantage of sparsity in machine learning and deep learning workloads. By only sending the top-k largest components of the gradient vector, we save up to 95% of the communication leading to speedups of up to 20x. We also showed how to optimize the training of a large speech assistant workload at Microsoft from weeks to days.
Preprints will be available on our publications webpage and on arXiv soon!