EARTH on K

Overview

EARTH on K is a derivative version of EARTH(Effective Aggregation Rounds with THrottling) optimization framework towards high performance MPI-IO. EARTH on K is available as an R-CCS software to achieve high performance MPI-IO using a local file system (LFS) of the K computer. Compared with the original MPI-IO at the K computer, EARTH on K has the following advanced features;

  • Striping layout aware data aggregation
  • Throttling I/O requests in accessing OSTs of LFS (Optional)
  • Stepwise data exchanges associated with the throttling in accessing OSTs (Optional)

What is EARTH on K?

2.1. Two-Phase I/O

EARTH on K is an enhanced two-phase I/O optimization customized for the K computer. EARTH on K provides MPI-IO optimized for Tofu interconnects and FEFS. Tofu is a 6-D torus/mesh interconnects and FEFS is an enhanced Lustre file system, both of them were developed by FUJITSU Limited. Fig. 1 shows a typical example of parallel I/O among four processes. Non-contiguous access pattern in each process leads to poor performance in I/O. two-phase I/O in collective write case. To improve I/O performance in such case, two phase I/O was proposed in ROMIO as shown in Fig. 2. Every processes act as aggregators which play I/O operations. They gather data in data exchange phase and write assigned data to a target file system.
The semantic view of the PVAS process model
Fig.1 Four processes accessing 2D data array Fig.2 Two-phase I/O among four processes

2.2. Striping layout aware aggregator layout

  • Aggregator layout has been arranged to suit to striping accesses for OSTs of LFS.
  • This scheme alleviates contention of network and OST accesses.

2.3. EARTH

  • Supported language environment
    • GM-1.2.0-23
    • GM-1.2.0-24
    • GM-1.2.0-25 (Current default version)
  • Installed path at the K computer
    • /opt/aics/earth/earth-1.1_GM-1.2.0-25/ (accessible from compute nodes only)

< Original MPI-IO at the K computer vs. EARTH on K >


How to use EARTH on K

  • Compile a program written using MPI-IO or application-oriented parallel-I/O API (PnetCDF, HDF5) using MPI-IO internally
    • Use existing FUJITSU’s cross compiler (mpifccpx or mpifrtpx)
  • Prepare a job script to run a compiled program
    • Three environment variables should be (re-)defined for the EARTH on K. (Details are described below.)
      • PATH
      • LD_LIBRARY_PATH
      • OPAL_PREFIX
  • Submit a job script
#!/bin/bash -x #PJM --rsc-list "rscgrp=small" #PJM --rsc-list "node=2x3x4:strict" #PJM --rsc-list "elapse=HH:MM:SS" #PJM --stg-transfiles all #PJM --mpi "use-rankdir" #PJM --mpi "proc=24" #PJM --mpi "rank-map-bychip:XYZ" #PJM --stgin-basedir xxxxxxx #PJM --stgin "rank=* ./a.out %r:./" #PJM -s . /work/system/Env_base
EARTH_ROOT=/opt/aics/earth/earth-1.1_GM-${GMVERSION} export PATH=${EARTH_ROOT}/bin:${PATH} export LD_LIBRARY_PATH=${EARTH_ROOT}/lib64:${LD_LIBRARY_PATH} export OPAL_PREFIX=${EARTH_ROOT} ENV_FLAG=”-x PATH=${PATH} -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} \ -x OPAL_PREFIX=${OPAL_PREFIX}”
Setup for three environment variables
  • PATH
  • LD_LIBRARY_PATH
  • OPAL_PREFIX
$ mpiexec -x ${ENV_FLAG} -n 24 ./a.out

How to tune I/O operations

  • Environment variables for the EARTH on K.

  • Hint for Tuning I/O Throttling

    EARTH on K provides throttling and stepwise data exchanges for further performance improvements in collective write.

    • FEFS_THROTTLING_REQ: The number of I/O requests in throttling * This should be below or equal to the following I/O request upper limit.
    I/O request upper limit = ppn * Nz / 2 (K-computer) (ppn: The number of processes per node, Nz: The number of processes along Z-direction in a 3D logical layout)
    • e.g. 2x3x32(=192 nodes) with 4 processes/node at the K computer I/O request upper limit = 4 * 32/2 = 64 FEFS_THROTTLING_REQ should be between 0 and 64.

Notice for users

  • EARTH on K is built with the help of FUJITSU Limited.
  • EARTH on K is provided as a binary library package only.
  • EARTH on K is available only at the K computer at this moment.

Publications

  • Yuichi Tsujita, Atsushi Hori, Toyohisa Kameyama, Atsuya Uno, Fumiyoshi Shoji, and Yutaka Ishikawa, “Improving Collective MPI-IO Using Topology-Aware Stepwise Data Aggregation with I/O Throttling,” In Proceedings of HPC Asia 2018, ACM, 2018 [Link]
  • Yuichi Tsujita, Atsushi Hori, and Yutaka Ishikawa, “Locality-Aware Process Mapping for High Performance Collective MPI-IO on FEFS with Tofu Interconnect,” In Proceedings of the 21th European MPI Users’ Group Meeting, ACM, 2014 [Link]

Useful links