Home Publications edited volumes Awards Research Teaching Miscellaneous Full CV [pdf] BLOG
Events
Past Events
|
Publications of Torsten Hoefler
K. B. Ferreira, P. Widener, S. Levy, D. Arnold, Torsten Hoefler:
| | Understanding the Effects of Communication and Coordination on Checkpointing at Scale
(presented in New Orleans, LA, USA, Nov. 2014, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC14) )
AbstractFault-tolerance poses a major challenge for future
large-scale systems. Active research into coordinated, uncoordinated, and hybrid checkpointing systems has explored how the
introduction of asynchrony can address anticipated scalability
issues. However, few insights into selection and tuning of these
protocols for applications at scale have emerged. In this paper, we
use a simulation-based approach to show that local checkpoint
activity in resilience mechanisms can significantly affect the
performance of key workloads, even when less than 1% of a
local node’s compute time is allocated to resilience mechanisms
(a very generous assumption). Specifically, we show that even
though much work on uncoordinated checkpointing has focused
on optimizing message log volumes, local checkpointing activity
may dominate the overheads of this technique at scale. Our
study shows that local checkpoints lead to process delays that
can propagate through messaging relations to other processes
causing a cascading series of delays. We demonstrate how to tune
hierarchical uncoordinated checkpointing protocols designed to
reduce log volumes to significantly reduce these synchronization
overheads at scale. Our work provides a critical analysis and
comparison of coordinated and uncoordinated checkpointing and
enables users and system administrators to fine-tune the checkpointing scheme to the application and system characteristics.
Documentsdownload article:
| | BibTeX | @inproceedings{uncoordinated-cr-communication, author={K. B. Ferreira and P. Widener and S. Levy and D. Arnold and Torsten Hoefler}, title={{Understanding the Effects of Communication and Coordination on Checkpointing at Scale}}, year={2014}, month={Nov.}, location={New Orleans, LA, USA}, note={Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC14)}, source={http://www.unixer.de/~htor/publications/}, } |
|
|