Unable to restore from AMI

0

I have been setting up a cluster similar to what is described here: https://aws.amazon.com/blogs/compute/running-ansys-fluent-on-amazon-ec2-c5n-with-elastic-fabric-adapter-efa/.

On my first attempt, I installed libreoffice and a touched a file to add some differences. I generated an AMI of the environment. I successfully restored the environment by adding a custom_ami tag in the cluster section of the config file. The resulting environment had the expected differences.

On my second attempt, I modified the setup significantly including:

  • Added an ebs section and specified encryption
  • Installed some third party software packages including conda and MATLAB
  • Specified a cronjob to initialize nicedcv sessions on reboot
    When I create an AMI of this environment and tell pcluster to use it, the resulting instance fails one of its status checks and is inaccessible.

Are there certain features that are incompatible with an AMI restore? Are there additional steps that are needed to make an AMI that is compatible with pcluster?

asked 3 years ago298 views
10 Answers
0
Accepted Answer

Hi David,

I understand that you're trying to create a new cluster by using as custom_ami the AMI of the head node of the another running cluster, but please correct me if I have misunderstood anything here.

If it's the case I have to confirm that it cannot work. You can't reuse an AMI from a running instance as base ami for a new cluster.

The reason is that during the bootstrap of the instance ParallelCluster executes configuration actions depending if it's the head node or a compute node of the cluster.
By using the head node ami you're trying to create a new cluster on which the configuration steps have been already executed, so this ami cannot work properly and it cannot be used as compute node.

If you're using the "Modify an AWS ParallelCluster AMI" approach you should always start from the AMIs in this list: https://github.com/aws/aws-parallelcluster/blob/v2.10.0/amis.txt
See more details here: https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_02_ami_customization.html#modify-an-aws-parallelcluster-ami

Let us know if it helps.

Edited by: enrico-aws on Dec 2, 2020 8:20 AM

AWS
answered 3 years ago
0

Hi @ProlucidDavid

the resulting instance fails one of its status checks and is inaccessible

from what you are describing, it seems that one of the modifications done in the instance is causing a problem at operating system-level
It's hard to say what is the root cause, it could be for example:

  • Failure to boot the operating system
  • Failure to mount volumes correctly
  • File system issues
  • Incompatible drivers
  • Kernel panic

I'll link you some guides that could help in troubleshooting the root cause

Please notice that you should be able to start the instance from the custom AMI you have created even outside of a cluster creation process.
So, before using the AMI to create a cluster, please make sure you are able to start an instance from it.

That said, it's not clear what is the process that you have followed to create a custom AMI, but it should be OK since you have make it through on your first attempt. Double checking that you have followed the official doc https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_02_ami_customization.html

AWS
answered 3 years ago
0

I never found the source of the failed AWS launch, however by following the instructions on your last link (https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_02_ami_customization.html), I was able to create a new template AMI that parallel cluster can launch successfully.

One issue to be aware of is that you should ensure that the template AMI that you start working with matches the version of Parallel cluster. I spent several hours troubleshooting because of this mismatch.

answered 3 years ago
0

Hi David,

we introduced a validation for the createami process as part of the 2.10 release: https://github.com/aws/aws-parallelcluster/releases

createami:

  • Add validation step to fail when using a base AMI created by a different version of ParallelCluster.
  • Add validation step for AMI creation process to fail if the selected OS and the base AMI OS are not consistent.

Could you confirm which version are you using by executing pcluster version command?

Thanks

AWS
answered 3 years ago
0

pcluster v2.9.1 was unable to launch an AMI that was based off of version v2.10.0. There wasn't good visibility on the cause of this behaviour (pcluster didn't report an incompatible image). It looked like the munge service was failing.

pcluster v2.10.0 successfully launched an AMI that was based off of version v2.10.0.

answered 3 years ago
0

I am re-opening this forum post.

Previously I had mentioned that following these steps I could make a custom AMI with installed software that pcluster could launch:(https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_02_ami_customization.html)

However if I take an AMI of the resulting pcluster image, I am unable to restore it. Steps that I take are:

  1. Use pcluster to launch HPC using custom AMI
  2. Take an AMI of the master node using the web console
  3. Use the same configuration file that originally launched the HPC, but modify the custom_ami tag from my successful base image, to the one created in step 1.
answered 3 years ago
0

Regarding the failure you saw with 2.9.1 version unfortunately it's expected because we added the validation steps to check the AMI version within the 2.10.0 release.

AWS
answered 3 years ago
0

Hi Enrico,

You understand correctly. I really appreciate your followups since we can now confirm that this is expected behaviour.

For context (in case there are any devs reviewing this thread) we had two goals by investigating this:

  1. Develop a strategy to take images so we could restore the system if it ever failed.
  2. It would be easier to develop our base AMI by installing software on a running HPC. In this way, we can verify that the software works as expected when compute nodes are instantiated. We can still test on an HPC, but then we need to reinstall on a base AMI and take an image, so its an extra setup step

But that's ok for now. The information on this thread has given us a path forward. I appreciate all your help

answered 3 years ago
0

Regarding the version check, that's a great feature that's been added! Thanks for confirming the behaviour

Edited by: ProlucidDavid on Dec 3, 2020 7:44 AM

answered 3 years ago
0

For completeness I just want to mention another way to customize your cluster, using custom bootstrap actions: https://docs.aws.amazon.com/parallelcluster/latest/ug/pre_post_install.html

The great thing is that this approach removes the extra step of creating a custom ami but clearly it is a good choice only if the customization steps don't require too much time or if they are only required in the head node.

Anyway thanks for the feedback.

AWS
answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions