Design and Implementation of Parallel File Aggregation Mechanism

Jun Kato and Yutaka Ishikawa

Some high-performance computing (HPC) applications sequentially access
millions of several-MB-sized files whose total capacity can exceed the
order of one terabyte.  Although such an application can utilize a single
shared file instead of this huge number of individual files, the single
shared file approach is not often used due to the performance bottleneck
inherent in this approach.  PFA, or the Parallel File Aggregation
mechanism, deployed on compute nodes, is proposed to promote the use of the
single shared file approach utilizing enhanced I/O processing on parallel
file systems, for such applications.  It provides APIs based on the
memory-map technique to avoid copy overhead between the user address space
and the file cache in the kernel address space.  It aggregates small I/Os
into one chunk of data, whose size is nearly the same size as that of the
file system block, so that the aggregated I/O is transferred through
lock-free, direct I/O.  The PFA mechanism also provides an incremental
logging feature to enable de-duplication of data.  The PFA mechanism
achieves over five times higher write bandwidth and double or higher read
bandwidth compared to the single shared file approach with the MPI-IO
measured through the MPI-IO Test benchmark.  The PFA mechanism also
demonstrates that the execution time of a modified version of an
application called Athena, which generates a huge number of files, is 3.8
times faster than that of the original program.