DDTBench: Micro-Applications for Communication Data Access Patterns and MPI Datatypes

Discamus continentiam augere, luxuriam coercere

Home -> Research -> MPI Datatypes -> DDTBench

  Publications
  Awards
  Research
    NB Collectives
    MPI Topologies
    MPI Datatypes
      DDTBench
    Netgauge
    Network Topologies
    Ethernet BTL eth
    ORCS
    DFSSSP
    Older Projects
    cDAG
    LogGOPSim
    CoMPIler
  Teaching

Miscellaneous

  Full CV [pdf]
  BLOG
  bio

Events

Past Events

DDTBench: Micro-Applications for Communication Data Access Patterns and MPI Datatypes

DDTBench is a suite of Micro-Apps that captures how parallel scientific applications from many different fields of science access the data that they send and receive between processes. MPI Derived Datattypes (DDTs) allow to specify those access patterns in such a way that no explicit copy operation is needed, in contrast to the pack-unpack loops found in many codes. In DDTBench we compare the packing overhead incurred by such loops to that of MPI DDTs. This is done by performing a ping-pong benchmark, once using MPI DDTs to specify how data should be packed and once using the pack-unpack loops that we found in the applications. The measurement loop of the benchmark is shown below:

Measurement loop of DDTBench. Measurements are taken on process 0, no global clock is not required.

Using the times it takes to perform each operation (colored block in the picture above) we can calculate the overhead for packing/unpacking data with both methods. Of course we can not measure this overhead directly in the case MPI DDTs are used, because data re-packing is implicit. But we can calculate the time used for transferring packed data, t_net, by subtracting the time required for manual packing and unpacking from the round-trip time of the ping-pong with manual packing. Now the data-repacking overhead for both cases can be calculated by subtracting t_net from the ping-pong round trip time and dividing the result by the ping pong round trip time. We did this for some of the micro-apps in the graph shown below:

Packing costs for different test cases

It can be seen that MPI DDTs can reduce the overhead associated with data-packing (i.e., from 40% to 15% in the case of NAS_LU_x, where a contiguous array is needlessly copied by the original code). The large difference between the performance delivered by Open MPIs DDT engine compared to that of MVAPICH shows that there is still some work to be done in improving MPI DDT implementations. We hope that DDTBench can server implementers as a guideline on which access patterns deserve special attention. A list of the micro-apps included in DDTBench can be found in the table below.

Application Class	Testname	Access Pattern
Atmospheric Science	WRF_x_vec	struct of 2D/3D/4D face exchanges in different directions (x,y), using different (semantically equivalent) datatypes: nested vectors (_vec) and subarrays (_sa)
	WRF_y_vec
	WRF_x_sa
	WRF_y_sa
Quantum Chromodynamics	MILC_su3_zd	4D face exchange, z direction, nested vectors
Fluid Dynamics	NAS_MG_x	3D face exchange in each direction (x,y,z) with vectors (y,z) and nested vectors (x)
	NAS_MG_y
	NAS_MG_z
	NAS_LU_x	2D face exchange in x direction (contiguous) and y direction (vector)
	NAS_LU_y
Matrix Transpose	FFT	2D FFT, different vector types on send/recv side
	SPECFEM3D_mt	3D matrix transpose
Molecular Dynamics	LAMMPS_full	unstructured exchange of different particle types (full/atomic), indexed datatypes
	LAMMPS_atomic
Geophysical Science	SPECFEM3D_oc	unstructured exchange of acceleration data for different earth layers, indexed datatypes
	SPECFEM3D_cm

DDTBench can be downloaded as ddtbench-1.2.1.tar.gz - (366.19 kb). It can be compiled with "make", if the resulting binary is executed it will write an output file "timings_test". The result file has 5 columns. The first column is the testname, as given in the table above. The second column specifies the type of benchmark (i.e., manual packing, DDT send/recv, packing with MPI DDTs, reference ping pong without packing), the third column specifies how many bytes are transferred for that particular test configuration (i.e., the MPI_Type_size() of the used datatype). The fourth column identifies the step in the benchmark (cf. with the figure on top) to which the time in the fifth column corresponds. Note that no statistical aggregation is done by the benchmark itself, the full information about each measured value is given to the user. The DDTBench tarball contains an R script which can produce pictures like the one shown above.

References

EuroMPI'12	[1] Timo Schneider, Robert Gerstenberger, Torsten Hoefler:
		Micro-Applications for Communication Data Access Patterns and MPI Datatypes Vol 7490, In Recent Advances in the Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012, Vienna, Austria, September 23-26, 2012. Proceedings, presented in Vienna, Austria, pages 121-131, Springer, ISBN: 978-3-642-33517-4, Sep. 2012, Invited to a journal special issue on top picks from EuroMPI'12.


serving: 216.73.216.123:1750	© Torsten Hoefler