Felix, qui, quod amat, defendere fortiter audet
Home -> Publications
Home
  Publications
    
all years
    2017
    2016
    2015
    2014
    2013
    2012
    2011
    2010
    2009
    2008
    2007
    2006
    2005
    2004
    theses
    techreports
    presentations
    edited volumes
    conferences
  Awards
  Research
  Teaching
  BLOG
  Miscellaneous
  Full CV [pdf]






  Events








  Past Events





Publications of Torsten Hoefler
Copyright Notice:

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

K. Kharbas, D. Kim, T. Hoefler and F. Mueller:

 Assessing HPC Failure Detectors for MPI Jobs

(In Proceedings of the 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, presented in Munich, Germany, pages 81--88, IEEE Computer Society, ISBN: 978-0-7695-4633-9, Feb. 2012)

Abstract

Reliability is one of the challenges faced by exascale computing. Components are poised to fail during large-scale executions given current mean time between failure (MTBF) projections. To cope with failures, resilience methods have been proposed as explicit or transparent tech- niques. For the latter techniques, this paper studies the challenge of fault detection. This work contributes a study on generic fault detection capabilities at the MPI level and beyond. The objective is to assess different detectors, which ultimately may or may not be implemented within the application’s runtime layer. A first approach utilizes a periodic liveness check while a second method promotes sporadic checks upon communication activities. The contributions of this paper are two-fold: (a) We provide generic interposing of MPI applications for fault detection. (b) We experimentally compare periodic and sporadic methods for liveness checking. We show that the sporadic approach, even though it imposes lower bandwidth requirements and utilizes lower frequency checking, results in equal or worse application performance than a periodic liveness test for larger number of nodes. We further show that performing liveness checks in separation from MPI applications results in lower overhead than interpositioning, as demonstrated by our prototypes. Hence, we promote separate periodic fault detection as the superior approach for fault detection.

Documents

download article:
 

BibTeX

@inproceedings{ftdetectors,
  author={K. Kharbas and D. Kim and T. Hoefler and F. Mueller},
  title={{Assessing HPC Failure Detectors for MPI Jobs}},
  year={2012},
  month={Feb.},
  pages={81--88},
  booktitle={Proceedings of the 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing},
  location={Munich, Germany},
  publisher={IEEE Computer Society},
  isbn={978-0-7695-4633-9},
  source={http://www.unixer.de/~htor/publications/},
}

serving: 54.205.126.164:56097© Torsten Hoefler