ParallelCluster, AWS Batch, Native libraries not found

0

I have a large MPI application written in fortran and attempting to get it running on pcluster with the awsbatch scheduler. The pcluster instance has an EFS drive mounted as /tiegcm_efs where pre-built native libraries are stored. The libraries were built on the master node of the cluster, so I was expecting that underlying dependencies, particularly openmpi, would be consistent between the master OS and the docker containers used in the runtime environment.

I'm using this page as a model for submitting and starting my MPI job:

https://aws-parallelcluster.readthedocs.io/en/latest/tutorials/03_batch_mpi.html#running-your-first-job-using-aws-batch

I used submit_mpi.sh as a starting point and have adjusted it for my case and to add a bunch of diagnostic output. Here are some key snippets from my submit_mpi.sh:

export LD_LIBRARY_PATH="/tiegcm_efs/dependencies/v20190320/lib:/usr/lib:/usr/lib64"

echo "Libs"
ls -l /tiegcm_efs/dependencies/v20190320/lib
ls -l /usr/lib64/openmpi/lib

echo "MPI"
/usr/lib64/openmpi/bin/mpirun -V

    cd /tiegcm_efs/home/kimyx/tiegcm.exec
...
    echo "Running main..."
    /usr/lib64/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 --allow-run-as-root --machinefile "${HOME}/hostfile" ./tiegcm "${TGCMDATA}"

The pcluster job is submitted like this (I've done it with and without the -e; for now I'm hardcoding the two environment variables within submit_mpi.sh):

awsbsub -c tiegcm -n 2 -p 4 -e LD_LIBRARY_PATH,TGCMDATA -cf submit_mpi.sh

The resulting output #0 and #1 both show this:

2019-03-20T16:00:24+00:00: ./tiegcm: error while loading shared libraries: libnetcdff.so.6: cannot open shared object file: No such file or directory
2019-03-20T16:00:24+00:00: ./tiegcm: error while loading shared libraries: libmpi_usempi.so.20: cannot open shared object file: No such file or directory

The libnetcdff.so.6 exists in /tiegcm_efs/dependencies/v20190320/lib but for some reason isn't being loaded. The following is from the ls command within the submit_mpi.sh script.

ls -l /tiegcm_efs/dependencies/v20190320/lib
2019-03-20T16:00:04+00:00: lrwxrwxrwx 1 1002 1005      19 Mar 20 03:28 libnetcdff.so -> libnetcdff.so.6.1.1
2019-03-20T16:00:04+00:00: lrwxrwxrwx 1 1002 1005      19 Mar 20 03:28 libnetcdff.so.6 -> libnetcdff.so.6.1.1
2019-03-20T16:00:04+00:00: -rwxr-xr-x 1 1002 1005 1448736 Mar 20 03:28 libnetcdff.so.6.1.1

However, the libmpi_usempi.so.20 is not found in the expected location /usr/lib64/openmpi/lib, even though all the systems are running Open MPI 2.1.1. The closest matching files within docker are:

ls -l /usr/lib64/openmpi/lib
2019-03-20T15:21:51+00:00: lrwxrwxrwx 1 root root     35 Mar 18 17:22 libmpi_usempi_ignore_tkr.so -> libmpi_usempi_ignore_tkr.so.20.10.0
2019-03-20T15:21:51+00:00: lrwxrwxrwx 1 root root     35 Mar 18 17:22 libmpi_usempi_ignore_tkr.so.20 -> libmpi_usempi_ignore_tkr.so.20.10.0
2019-03-20T15:21:51+00:00: -rwxr-xr-x 1 root root  23216 Aug 29  2018 libmpi_usempi_ignore_tkr.so.20.10.0
2019-03-20T15:21:51+00:00: lrwxrwxrwx 1 root root     27 Mar 18 17:22 libmpi_usempif08.so -> libmpi_usempif08.so.20.10.0
2019-03-20T15:21:51+00:00: lrwxrwxrwx 1 root root     27 Mar 18 17:22 libmpi_usempif08.so.20 -> libmpi_usempif08.so.20.10.0
2019-03-20T15:21:51+00:00: -rwxr-xr-x 1 root root 200216 Aug 29  2018 libmpi_usempif08.so.20.10.0

whereas on the master node outside of Docker the same directory has this:

lrwxrwxrwx 1 root root     24 Jan  7 12:54 libmpi_usempi.so -> libmpi_usempi.so.20.10.0
lrwxrwxrwx 1 root root     24 Jan  7 12:54 libmpi_usempi.so.20 -> libmpi_usempi.so.20.10.0
-rwxr-xr-x 1 root root   7344 Aug 29  2017 libmpi_usempi.so.20.10.0

I see the Jan 7 date here; I don't think I installed openmpi myself after creating the cluster.

Unlike the sample MPI program I can't compile my big application when the job starts within Docker. For one thing, gmake isn't installed within the docker container. For another, it takes a long time to build all the dependencies.

To be clear, this application runs fine (but slowly) when I run it directly on the master EC2 using the same files and LD_LIBRARY_PATH, but skipping the mpirun wrapper.

Am I missing something about how to specify library search paths within the awsbatch environment? Let me know if you need any more details.

Thanks,
Kim

Edited by: kimyx on Mar 20, 2019 10:42 AM

kimyx
asked 5 years ago415 views
6 Answers
0

I do find this regarding libmpi_usempi_ignore_tkr.so.0 and libmpi_usempif08.so.0:

https://github.com/open-mpi/ompi/issues/649

particularly:

"....You must use a "recent enough" Fortran compiler to get support for these two libraries. If you're using an older Fortran compiler (e.g., gfortran <4.9), you'll be forced into a legacy implementation of the use mpi Fortran bindings, and the mpi_f08 bindings won't be built at all....."

The openmpi on the master node created by pcluster falls into the category of not recent enough. The gfortran version seems to be 4.8.5:

[kimyx@ip-10-0-0-126 tiegcm.exec]$ gfortran -v
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-amazon-linux/4.8.5/lto-wrapper
Target: x86_64-amazon-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,fortran,ada,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-amazon-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-amazon-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-amazon-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) 

and the openmpi does not have the "recent" libraries. So perhaps a solution to the MPI library problems is to upgrade gcc/gfortran/openmpi on the master node.

I don't think this explains the issue finding the netcdff library, though.

kimyx
answered 5 years ago
0

I decided to try building my dependency packages by running the build script within the cluster's compute node's docker environment. This should lead to a consistent set of packages when the code is actually run.

First step was to install all the needed build tools within the docker container. I compared my working alinux build environment to what's already installed in the compute node and added this to my build script:

yum -q list installed make &>/dev/null && echo "make already installed" || yum -y install make |& tee -a ${log_file}
yum -q list installed autoconf &>/dev/null && echo "autoconf already installed" || yum -y install autoconf |& tee -a ${log_file}
yum -q list installed automake &>/dev/null && echo "automake already installed" || yum -y install automake |& tee -a ${log_file}
yum -q list installed gcc-c++ &>/dev/null && echo "gcc-c++ already installed" || yum -y install gcc-c++ |& tee -a ${log_file}
yum -q list installed gettext &>/dev/null && echo "gettext already installed" || yum -y install gettext |& tee -a ${log_file}
yum -q list installed libtool &>/dev/null && echo "libtool already installed" || yum -y install libtool |& tee -a ${log_file}

These install fine and the build works through the first simple package (zlib). Unfortunately it fails on the next package (hdf), which is much more complex. The gcc-7.3.1 compiler actually hangs while compiling a particular c file.

H5Tconv.c:4039:21: warning: missed loop optimization, the loop counter may overflow [-Wunsafe-loop-optimizations]
                     for (i = 0; i < tsize; i += 4) {
                     ^~~
...wait about an hour
gcc: internal compiler error: Killed (program cc1)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://bugzilla.redhat.com/bugzilla> for instructions.
make[1]: *** [H5Tconv.lo] Error 1
make[1]: Leaving directory `/tiegcm_efs/dependencies/v20190320_docker/src/hdf5-1.10.4/src'

It turns out that gcc 7.3 is not currently supported by the hdf library. The latest supported gcc versions are 4.8.5 or 4.9.3. This means I cannot use the pcluster awsbatch default compute environment, especially since RedHat probably isn't going to be quick about addressing the gcc bug in the odd non-RHEL environment where it's running.

So I need to step back and decide how to proceed. My current model must be run in an standard, supported environment; once the current model works, there will be others to come and they will have similar version requirements. Could you help me make some decisions about how best to use pcluster?

With the pcluster awsbatch back-end, is there a way to control the docker container that it uses? Specifically, can it run a container based on the standard alinux development tools? The alinux AMI runs gcc 4.8.5. I see that alinux2 has gcc 7.3.1, but the major scientific computing packages aren't ready for that.

When pcluster is used with the traditional sge scheduler, is the compute environment based on Docker or does it run on a standard machine image? If I set base_os=alinux, will my compute jobs see the standard alinux tools?

Thanks,
Kim

kimyx
answered 5 years ago
0

Let me try to answer your questions about ParallelCluster.

When using scheduler = awsbatch:
• at the moment the only supported base_os is alinux. This translates into:
•• master node running on standard EC2 instance with alinux (v1) AMI customized for ParallelCluster (at every pcluster release we pull the latest alinux ami and install all the tools needed by pcluster)
•• jobs are running in docker containers where the docker images are built by pcluster with the following Dockerfile: https://github.com/aws/aws-parallelcluster/blob/develop/cli/pcluster/resources/batch/docker/alinux/Dockerfile and use amazonlinux:latest as base Docker image
• unfortunately at the moment there is no support to customise the docker image that pcluster is using but the following alternatives are available:
•• the job installs the required dependencies before running (introduces a delay for each job)
•• the necessary dependencies are built on the master node and shared with the docker containers by using the shared file systems (EBS, RAID or EFS) (faster approach)

When using scheduler = slurm|torque|sge:
• no docker containers are used in this setup. Master and compute are running once of the AMI customized for pcluster based on the chosen os
• you can choose among the following OSs: alinux, centos7, ubuntu1604, ubuntu1404, centos6

In both cases you can customize the AMI used by pcluster by following this simple guide: https://aws-parallelcluster.readthedocs.io/en/latest/tutorials/02_ami_customization.html.
Be aware that while with traditional schedulers this AMI is used for both master and compute nodes, in case of awsbatch this AMI will be used only on the master node since the jobs will run in the docker containers as described above (but might be enough if you decide to build your dependencies on the master node and than share them through the shared file system).

AWS
answered 5 years ago
0

After further experiments, I ended up using ParallelCluster with scheduler=sge and base_os=centos7. The centos7 OS provides the gcc, gfortran, openmpi versions needed for compatibility with some of the required libraries. The sge scheduler doesn't use Docker, and the development tools in the compute environment are the same as those available after ssh to the master node. The sge scheduler runs jobs as the user who submitted them, which solves permissions problems I was having on an EFS drive. My application seems to be running correctly and efficiently now.

In case it helps someone else, I'll summarize my configuration here. My goal is to run physical models of space weather and atmospheric processes that traditionally used HPC supercomputer clusters with MPI concurrency. These models are usually written in fortran, c, and c++, and depend on underlying libraries (e.g, hdf, netcdf) that don't support the latest gcc versions.

Building the executables is a two-phase process. Dependencies that don't need to change often are compiled from source to shared libraries with the output going to an EFS volume that is mounted on all the relevant cluster machines. The main model code, usually under further development, is stored in a CodeCommit repo and is compiled and linked (subject to gmake rules) for each run. Jobs are then submitted using SGE qsub and OpenMPI mpirun. The model I'm testing with now is called TIEGCM and this name appears in the following.

Here's the .parallelcluster/config file, with some identifiers redacted:

[aws]
aws_region_name = myregion

[cluster tiegcm]
base_os = centos7
scheduler = sge
vpc_settings = public
key_name = myuser-key-pair-myregion
compute_instance_type = c5.large
initial_queue_size = 0
max_queue_size = 16
maintain_initial_size = true
scaling_settings = fast_down
# installs efs tools and mounts
post_install = s3://mybucket/pcluster/setup_tiegcm.sh
s3_read_resource = arn:aws:s3:::mybucket/*

[vpc public]
vpc_id = vpc-xxx
master_subnet_id = subnet-xxx
# for mounting to EFS
additional_sg = sg-xxx

[scaling fast_down]
scaledown_idletime = 3

[global]
update_check = true
sanity_check = true
cluster_template = tiegcm

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

The post_install script sets up some needed libraries, users, and the EFS mount:

#!/bin/bash

# install amazon-efs-utils if necessary
yum -q list installed amazon-efs-utils &>/dev/null
if [ $? -ne 0 ]; then
  cd /scratch
  git clone https://github.com/aws/efs-utils
  cd efs-utils
  sudo make rpm
  sudo yum -y install ./build/amazon-efs-utils*rpm
fi

# install libcurl-devel if necessary; needed for building tiegcm
yum -q list installed libcurl-devel &>/dev/null && echo "libcurl-devel already installed" || yum -y install libcurl-devel

# mount to shared EFS drive for data input/output
cd /
mkdir -p tiegcm_efs
mount -t efs fs-xxx:/ tiegcm_efs
chmod go+rw /tiegcm_efs

# add a group for all tiegcm linux users
sudo groupadd -g 1005 tiegcm
sudo useradd -u 1002 -d /tiegcm_efs/home/myuser myuser
# make tiegcm the primary group for myuser
sudo usermod -g tiegcm myuser

The cluster is created from one small EC2 I call tiegcm-manage, which has pcluster installed and configured. Only cluster admins have access to this EC2.

Jobs are submitted from the master EC2 of the cluster. Cluster users ssh in as themselves ("myuser") and submit jobs from there. Their home directories reside on the shared EFS drive so they are retained even if the cluster is rebuilt. They run a bash script that does the make process and then submits the job. Here are the commands to submit the job.

# submit_mpi.sh, generated by the make process
nprocs=16
export LD_LIBRARY_PATH="/tiegcm_efs/dependencies/v20190323_centos7/lib:/usr/lib:/usr/lib64"
...
/usr/lib64/openmpi/bin/mpirun -np ${nprocs} --mca btl_tcp_if_include eth0 ./tiegcm "${tgcminput}"

# main build script, submit job to SGE
qsub -pe mpi ${nprocs} -o ~/tiegcm_job_out.txt -e ~/tiegcm_job_err.txt -cwd submit_mpi.sh

After the job is submitted, it waits in the SGE queue until sufficient compute EC2s are started. The cluster is configured so that every compute node will terminate after 3 minutes of idle time, to avoid unnecessary charges. It takes about 5 minutes for a new compute fleet to spin up and then the job takes off with near 100% CPU usage on all the compute nodes.

The SGE command line tools have an overwhelming set of options, but so far I've needed only these:

# see which jobs are waiting or running
qstat -f 

# see which EC2s are assigned to jobs from myuser
qhost -u myuser

# kill a job that isn't acting right
qdel jobnum

I usually tail the out and err logs. There's some buffering, but they keep up to date pretty well.

kimyx
answered 5 years ago
0

Glad to hear that you found a working solution that fits your needs and thanks for sharing your approach!

In the upcoming version of ParallelCluster we are going to release a series of enhancements for Slurm scheduler that will further increase the robustness of the scaling process. Just saying in case you want to try it out in place of SGE!

Edited by: francescodm-aws on Mar 25, 2019 1:40 AM

AWS
answered 5 years ago
0

Thanks for the info about slurm, I'll check it out.

kimyx
answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions