Design and Implementation of Parallel File Aggregation Mechanism Jun Kato and Yutaka Ishikawa Some high-performance computing (HPC) applications sequentially access millions of several-MB-sized files whose total capacity can exceed the order of one terabyte. Although such an application can utilize a single shared file instead of this huge number of individual files, the single shared file approach is not often used due to the performance bottleneck inherent in this approach. PFA, or the Parallel File Aggregation mechanism, deployed on compute nodes, is proposed to promote the use of the single shared file approach utilizing enhanced I/O processing on parallel file systems, for such applications. It provides APIs based on the memory-map technique to avoid copy overhead between the user address space and the file cache in the kernel address space. It aggregates small I/Os into one chunk of data, whose size is nearly the same size as that of the file system block, so that the aggregated I/O is transferred through lock-free, direct I/O. The PFA mechanism also provides an incremental logging feature to enable de-duplication of data. The PFA mechanism achieves over five times higher write bandwidth and double or higher read bandwidth compared to the single shared file approach with the MPI-IO measured through the MPI-IO Test benchmark. The PFA mechanism also demonstrates that the execution time of a modified version of an application called Athena, which generates a huge number of files, is 3.8 times faster than that of the original program.