Discamus continentiam augere, luxuriam coercere
Home -> Publications
Home
  Publications
    
all years
    2017
    2016
    2015
    2014
    2013
    2012
    2011
    2010
    2009
    2008
    2007
    2006
    2005
    2004
    theses
    techreports
    presentations
    edited volumes
    conferences
  Awards
  Research
  Teaching
  BLOG
  Miscellaneous
  Full CV [pdf]






  Events








  Past Events





Publications of Torsten Hoefler
Copyright Notice:

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

K. B. Ferreira, P. Widener, S. Levy, D. Arnold, T. Hoefler:

 Understanding the Effects of Communication and Coordination on Checkpointing at Scale

(presented in New Orleans, LA, USA, Nov. 2014, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC14) )

Abstract

Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid checkpointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. However, few insights into selection and tuning of these protocols for applications at scale have emerged. In this paper, we use a simulation-based approach to show that local checkpoint activity in resilience mechanisms can significantly affect the performance of key workloads, even when less than 1% of a local node’s compute time is allocated to resilience mechanisms (a very generous assumption). Specifically, we show that even though much work on uncoordinated checkpointing has focused on optimizing message log volumes, local checkpointing activity may dominate the overheads of this technique at scale. Our study shows that local checkpoints lead to process delays that can propagate through messaging relations to other processes causing a cascading series of delays. We demonstrate how to tune hierarchical uncoordinated checkpointing protocols designed to reduce log volumes to significantly reduce these synchronization overheads at scale. Our work provides a critical analysis and comparison of coordinated and uncoordinated checkpointing and enables users and system administrators to fine-tune the checkpointing scheme to the application and system characteristics.

Documents

download article:
 

BibTeX

@inproceedings{uncoordinated-cr-communication,
  author={K. B. Ferreira and P. Widener and S. Levy and D. Arnold and T. Hoefler},
  title={{Understanding the Effects of Communication and Coordination on Checkpointing at Scale}},
  year={2014},
  month={Nov.},
  location={New Orleans, LA, USA},
  note={Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC14)},
  source={http://www.unixer.de/~htor/publications/},
}

serving: 54.158.21.160:57478© Torsten Hoefler