November | 2013 | Torsten Hoefler's blog

Today I typed the last command on my long-running server (serving www.unixer.de since 2006 until yesterday):

benten ~ $ uptime
 01:31:47 up 676 days, 16:20,  5 users,  load average: 2.08, 1.42, 1.39
benten ~ $ dd if=/dev/zero of=/dev/hda &
benten ~ $ dd if=/dev/zero of=/dev/hdb &
benten ~ $ dd if=/dev/zero of=/dev/hdc &

This machine was an old decommissioned cluster node (well, a result of
combining two half-working nodes) and served me since 2006 (seven
years!) very well. Today, it was shut off.

It’s nearly historic (single-core!):

benten ~ $ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 1
model name      : Intel(R) Pentium(R) 4 CPU 1.50GHz
stepping        : 2
cpu MHz         : 1495.230
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi m
mx fxsr sse sse2 ss ht tm up pebs bts
bogomips        : 2995.24
clflush size    : 64
power management:

benten ~ $ free
             total       used       free     shared    buffers     cached
Mem:        775932     766372       9560          0      85720     273792
-/+ buffers/cache:     406860     369072
Swap:       975200      57904     917296

benten ~ $ fdisk -l 
Disk /dev/hda: 20.0 GB, 20020396032 bytes
Disk /dev/hdb: 500.1 GB, 500107862016 bytes
Disk /dev/hdc: 80.0 GB, 80026361856 bytes
Disk /dev/hdd: 500.1 GB, 500107862016 bytes

benten ~ $ cat /proc/mdstat
Personalities : [raid1] 
md0 : active raid1 hdb1[0] hdd1[2](F)
      488383936 blocks [2/1] [U_]

Pavan Balaji, Jim Dinan, Rajeev Thakur and I are giving our Advanced MPI Programming tutorial at Supercomputing 2013 on Sunday November 17th.

Are you wondering about the new MPI-3 standard? How it affects you as a scientific or HPC programmer and what nice new features you can use to make your life easier and your application faster? Then you should not miss our tutorial.

Our abstract summarizes the main topics:

The vast majority of production parallel scientific applications today use MPI and run successfully on the largest systems in the world. For example, several MPI applications are running at full scale on the Sequoia system (on ?1.6 million cores) and achieving 12 to 14 petaflops/s of sustained performance. At the same time, the MPI standard itself is evolving (MPI-3 was released late last year) to address the needs and challenges of future extreme-scale platforms as well as applications. This tutorial will cover several advanced features of MPI, including new MPI-3 features, that can help users program modern systems effectively. Using code examples based on scenarios found in real applications, we will cover several topics including efficient ways of doing 2D and 3D stencil computation, derived datatypes, one-sided communication, hybrid (MPI + shared memory) programming, topologies and topology mapping, and neighborhood and nonblocking collectives. Attendees will leave the tutorial with an understanding of how to use these advanced features of MPI and guidelines on how they might perform on different
platforms and architectures.

This tutorial is about advanced use of MPI. It will cover several advanced features that are part of
MPI-1 and MPI-2 (derived datatypes, one-sided communication, thread support, topologies and topology
mapping) as well as new features that were recently added to MPI as part of MPI-3 (substantial additions
to the one-sided communication interface, neighborhood collectives, nonblocking collectives, support for
shared-memory programming).

Implementations of MPI-2 are widely available both from vendors and open-source projects. In addition,
the latest release of the MPICH implementation of MPI supports all of MPI-3. Vendor implementations
derived from MPICH will soon support these new features. As a result, users will be able to use in practice
what they learn in this tutorial.

The tutorial will be example driven, reflecting scenarios found in real applications. We will begin with
a 2D stencil computation with a 1D decomposition to illustrate simple Isend/Irecv based communication.

We will then use a 2D decomposition to illustrate the need for MPI derived datatypes. We will introduce
a simple performance model to demonstrate what performance can be expected and compare it with actual
performance measured on real systems. This model will be used to discuss, evaluate, and motivate the rest
of the tutorial.
We will use the same 2D stencil example to illustrate various ways of doing one-sided communication in
MPI and discuss the pros and cons of the different approaches as well as regular point-to-point communica-
tion. We will then discuss a 3D stencil without getting into complicated code details.
We will use examples of distributed linked lists and distributed locks to illustrate some of the new ad-
vanced one-sided communication features, such as the atomic read-modify-write operations.
We will discuss the support for threads and hybrid programming in MPI and provide two hybrid ver-
sions of the stencil example: MPI+OpenMP and MPI+MPI. The latter uses the new features in MPI-3 for
shared-memory programming. We will also discuss performance and correctness guidelines for hybrid pro-
gramming.

We will introduce process topologies, topology mapping, and the new “neighborhood” collective func-
tions added in MPI-3. These collectives are particularly intended to support stencil computations in a scalable
manner, both in terms of memory consumption and performance.
We will conclude with a discussion of other features in MPI-3 not explicitly covered in this tutorial
(interface for tools, Fortran 2008 bindings, etc.) as well as a summary of recent activities of the MPI Forum
beyond MPI-3.

Our planned agenda for the day is

Introduction (8.30–10.00)
- Background: What is MPI
- MPI-1, MPI-2, MPI-3
- 2D stencil code with 1D decomposition: Isend/Irecv version
- 2D stencil code with 2D decomposition: Introduce derived datatypes
- Introduce simple performance modeling and measurement
One-Sided Communication (10.30–12.00)
- Basics of one-sided communication or remote memory access (RMA)
- 2D stencil code with 1D decomposition: RMA with 3 forms of synchronization
- 3D stencil: What changes and what to pay attention to
- Introduce other features of MPI-3 RMA
- Linked list or distributed lock example demonstrating new MPI-3 RMA features
Lunch (12.00–1.30)
MPI and Threads (1.30–3.00)
- What does the MPI standard specify about threads
- How does it enable hybrid programming
- Hybrid (MPI+OpenMP) version of 2D stencil code
- Hybrid (MPI+MPI) version of 2D stencil code using MPI-3 shared-memory support
- Performance and correctness guidelines for hybrid programming
Topologies, Neighborhood/Nonblocking Collectives (3.30-5.00)
- Topologies and topology mapping
- 2D stencil code with 2D decomposition using neighborhood collectives
- MPI-3 nonblocking collectives with example
- Summary of other features in MPI-3
- Summary of recent activities of the MPI Forum
- Conclusions

We’re looking forward to many interesting discussions!

Torsten Hoefler's blog

nothing spectacular

Emerging Technologies ramping up at SC13

The end of an old reliable friend

Advanced MPI Programming Tutorial at Supercomputing 2013