Non quia difficilia sunt non audemus, sed quia non audemus difficilia sunt
Home -> Teaching -> CS498
MPI Tutorials

  Full CV [pdf]


  Past Events

CS498: Hot Topics in High-Performance Computing: Networking and Fault Tolerance

This is the class webpage for the course CS498 Hot Topics in High-Performance Computing: Networking and Fault Tolerance. The class is taught by Prof. Franck Cappello (Fault Tolerance) and Prof. Torsten Hoefler (Networking).

Topics: Hot Topics in High Performance Parallel Computing: Networks and Fault Tolerance. Large-scale computer systems such as Petascale or upcoming Exascale machines pose significant challenges on the system and software designers. In this course, we will address to very important topics in this design: HPC networking and Fault Tolerance. The network will soon be the most expensive and critical part of large machines and fault tolerance is needed to ensure correct operation under the increasing probability of failures of single elements. This course requires basic knowledge in graph theory and system architecture. This section is for undergraduate or graduate students offering 3 or 4 credits respectively.

Class Wiki:

The full slides and all administrative details and additional class materials are posted in the Class Wiki.

The source books for the slides of each lecture are listed in the wiki!


Networking (Prof. Hoefler)

The lecture is divided into multiple sections that do not correspond to the class numbers (some sections span multiple classes). The slides are merely for reference, all analytic models, equations, and examples will be discussed at the whiteboard to allow the students to follow the constructions in detail and improve the interactivity of the class. Nevertheless, all students are encouraged to take notes.

1. Introduction to Parallel Computer Architecture (I) [Lecture 1 - (897.01 kb)]
2. Introduction to Parallel Computer Architecture (II) [Lecture 2 - (989.46 kb)]
3. A Network-centric View on HPC [Lecture 3 - (423.94 kb)]
4. HPC Networking Basics [Lecture 4 - (299.45 kb)]
5. Advanced Network Models (I) [Lecture 5 - (233.22 kb)]
6. Advanced Network Models (II) [Lecture 6 - (461.95 kb)]
7. Network Topology (I) [Lecture 7 - (631.89 kb)]
8. Network Topology (II) [Lecture 8 - (473.95 kb)]
9. Routing[Lecture 9 - (199.01 kb)]
10. Routing Examples, Flow Control, Blue Waters Topology[Lecture 10 - (2066.6 kb)]

Fault Tolerance (Prof. Cappello)

  1. Why Fault tolerance in HP, What are Errors, faults failures in HPC systems?
  2. Fault tolerance techniques?
  3. What is checkpointing and when to checkpoint?
  4. Where to checkpoint?
  5. How to make sure that the checkpointed execution will lead to correct results after restart?
  6. How to Coordinate checkpointing?
  7. Uncoordinated checkpointing
  8. Message logging protocols
  9. Hybrid protocols

serving:© Torsten Hoefler