Discamus continentiam augere, luxuriam coercere
Home -> Publications
Home
  Publications
    
edited volumes
  Awards
  Research
  Teaching
  BLOG
  Miscellaneous
  Full CV [pdf]
  blog






  Events








  Past Events





Publications of Torsten Hoefler
Martin Kuettler, Maksym Planeta, Jan Bierbaum, Carsten Weinhold, Hermann Haertig, Amnon Barak, Torsten Hoefler:

 Corrected Trees for Reliable Group Communication

(Feb. 2019, Accepted at The ACM Conference Principles and Practice of Parallel Programming 2019 (PPoPP'19) )

Abstract

Driven by ever increasing performance demands of compute-intensive applications, supercomputing systems comprise more and more nodes. This growth is a significant burden for fast group communication primitives and also makes those systems more susceptible to failures of individual nodes. In this paper we present a two-phase fault-tolerant scheme for group communication. Using broadcast as an example, we provide a full-spectrum discussion of our approach — from a formal analysis to LogP-based simulations to a message-passing-based implementation running on a large cluster. Ultimately, we are able to reduce the complex problem of reliable and fault-tolerant collective group communication to a graph theoretical renumbering problem. Both simulations and measurements show our solution to achieve a latency reduction of 50% with up to 6 times fewer messages sent in comparison to existing schemes.

Documents

download article:
download slides:
 

BibTeX

@inproceedings{,
  author={Martin Kuettler and Maksym Planeta and Jan Bierbaum and Carsten Weinhold and Hermann Haertig and Amnon Barak and Torsten Hoefler},
  title={{Corrected Trees for Reliable Group Communication}},
  year={2019},
  month={Feb.},
  note={Accepted at The ACM Conference Principles and Practice of Parallel Programming 2019 (PPoPP'19)},
  source={http://www.unixer.de/~htor/publications/},
}


serving: 3.219.31.204:36894© Torsten Hoefler