Life would be so much easier if only we had the source code...
Home -> Publications
Home
  Publications
    
edited volumes
  Awards
  Research
  Teaching
  Miscellaneous
  Full CV [pdf]
  BLOG






  Events








  Past Events





Publications of Torsten Hoefler
Kartik Lakhotia, Kelly Isham, Laura Monroe, Maciej Besta, Torsten Hoefler, Fabrizio Petrini:

 In-network Allreduce with Multiple Spanning Trees on PolarFly

(In Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'23), presented in Orlando, FL, USA, pages 165–176, Association for Computing Machinery, ISBN: 9781450395458, Jun. 2023)

Publisher Reference

Abstract

Allreduce is a fundamental collective used in parallel computing and distributed training of machine learning models, and can become a performance bottleneck on large systems. In-network computing improves Allreduce performance by reducing packets on the fly using network routers. However, the throughput of current in-network solutions is limited to a single link bandwidth. We develop, compare and contrast two different sets of Allreduce spanning trees embedded into PolarFly, a high-performance diameter-2 network topology. Both of our solutions offer theoretically guaranteed near-optimal performance, boosting Allreduce bandwidth by a factor equal to half the network radix of nodes. While our first set offers low-latency with trees of depth-3, the second set offers congestion-free implementation which reduces complexity and resource requirements of in-network computing units. In doing so, we also distinguish PolarFly as a highly suitable network for distributed deep learning and other applications that employ throughput-bound large Allreductions.

Documents

Publisher URL: https://doi.org/10.1145/3558481.3591073download article:
 

BibTeX

@inproceedings{lakhotia2023network,
  author={Kartik Lakhotia and Kelly Isham and Laura Monroe and Maciej Besta and Torsten Hoefler and Fabrizio Petrini},
  title={{In-network Allreduce with Multiple Spanning Trees on PolarFly}},
  year={2023},
  month={Jun.},
  pages={165–176},
  booktitle={Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'23)},
  location={Orlando, FL, USA},
  publisher={Association for Computing Machinery},
  isbn={9781450395458},
  source={http://www.unixer.de/~htor/publications/},
}


serving: 18.118.144.199:7485© Torsten Hoefler