EMR: Terminated with errors Bootstrap failure

0

I am just trying to build a cluster with hadoop, hive and spark preinstalled on emr-5.23.0. Till today morning it was working fine and suddenly it started failing with the below error and the cluster getting terminated: Terminated with errors Bootstrap failure.
I tried to capture the errors before the cluster dies off:


master.log


[ec2-user@ip-00-000-000-00 ~]$ cat /mnt/var/log/bootstrap-actions/master.log
2019-04-16 19:08:42,443 INFO i-078302fec7edd117b: new instance started
2019-04-16 19:08:42,451 ERROR i-078302fec7edd117b: failed to start. bootstrap action 1 failed with non-zero exit code.
2019-04-16 19:09:25,805 INFO i-02fc7b0d9e6815b8d: new instance started
2019-04-16 19:09:27,826 INFO i-02fc7b0d9e6815b8d: all bootstrap actions complete and instance ready
2019-04-16 19:10:00,812 INFO i-0261bda9ec7074d13: new instance started
2019-04-16 19:10:03,482 INFO i-0261bda9ec7074d13: all bootstrap actions complete and instance ready


cloud-init


Apr 16 19:04:23 cloud-init[4713]: util.py[WARNING]: Running module yum-configure (<module 'cloudinit.config.cc_yum_configure' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_yum_configure.pyc'>) failed
Apr 16 19:04:23 cloud-init[4713]: util.py[DEBUG]: Running module yum-configure (<module 'cloudinit.config.cc_yum_configure' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_yum_configure.pyc'>) failed
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/cloudinit/stages.py", line 660, in _run_modules
cc.run(run_name, mod.handle, func_args, freq=freq)
File "/usr/lib/python2.7/dist-packages/cloudinit/cloud.py", line 63, in run
return self._runners.run(name, functor, args, freq, clear_on_fail)
File "/usr/lib/python2.7/dist-packages/cloudinit/helpers.py", line 197, in run
results = functor(args)
File "/usr/lib/python2.7/dist-packages/cloudinit/config/cc_yum_configure.py", line 53, in handle
raise errors[-1]
KeyError: 'regional'
Apr 16 19:04:23 cloud-init[4713]: stages.py[DEBUG]: Running module yum-add-repo (<module 'cloudinit.config.cc_yum_add_repo' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_yum_add_repo.pyc'>) with frequency once-per-instance
Apr 16 19:04:23 cloud-init[4713]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_yum_add_repo - wb: [644] 20 bytes
Apr 16 19:04:23 cloud-init[4713]: helpers.py[DEBUG]: Running config-yum-add-repo using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_yum_add_repo'>)
Apr 16 19:04:23 cloud-init[4713]: cc_yum_add_repo.py[INFO]: Skipping repo emr-platform, file /etc/yum.repos.d/emr-platform.repo already exists!
Apr 16 19:04:23 cloud-init[4713]: cc_yum_add_repo.py[INFO]: Skipping repo emr-apps, file /etc/yum.repos.d/emr-apps.repo already exists!
Apr 16 19:04:23 cloud-init[4713]: stages.py[DEBUG]: Running module package-update-upgrade-install (<module 'cloudinit.config.cc_package_update_upgrade_install' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_package_update_upgrade_install.pyc'>) with frequency once-per-instance
Apr 16 19:04:23 cloud-init[4713]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_package_update_upgrade_install - wb: [644] 20 bytes
Apr 16 19:04:23 cloud-init[4713]: helpers.py[DEBUG]: Running config-package-update-upgrade-install using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_package_update_upgrade_install'>)
Apr 16 19:04:23 cloud-init[4713]: amazon.py[DEBUG]: Upgrade level: security
Apr 16 19:04:23 cloud-init[4713]: util.py[DEBUG]: Running command ['yum', '-t', '-y', '--exclude=kernel', '--exclude=nvidia
', '--exclude=cudatoolkit', '--security', '--sec-severity=critical', '--sec-severity=important', 'upgrade'] with allowed return codes [0] (shell=False, capture=False)
Apr 16 19:05:24 cloud-init[4713]: util.py[WARNING]: Package upgrade failed
Apr 16 19:05:24 cloud-init[4713]: util.py[DEBUG]: Package upgrade failed
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/cloudinit/config/cc_package_update_upgrade_install.py", line 81, in handle
cloud.distro.upgrade_packages(upgrade_level, upgrade_exclude)
File "/usr/lib/python2.7/dist-packages/cloudinit/distros/amazon.py", line 50, in upgrade_packages
return self.package_command('upgrade', args=args)
File "/usr/lib/python2.7/dist-packages/cloudinit/distros/rhel.py", line 211, in package_command
util.subp(cmd, capture=False, pipe_cat=True, close_stdin=True)
File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1626, in subp
cmd=args)
ProcessExecutionError: Unexpected error while running command.
Command: ['yum', '-t', '-y', '--exclude=kernel', '--exclude=nvidia*', '--exclude=cudatoolkit', '--security', '--sec-severity=critical', '--sec-severity=important', 'upgrade']
Exit code: 1
Reason: -
Stdout: ''
Stderr: ''
Apr 16 19:05:24 cloud-init[4713]: cc_package_update_upgrade_install.py[WARNING]: 1 failed with exceptions, re-raising the last one
Apr 16 19:05:24 cloud-init[4713]: util.py[WARNING]: Running module package-update-upgrade-install (<module 'cloudinit.config.cc_package_update_upgrade_install' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_package_update_upgrade_install.pyc'>) failed
Apr 16 19:05:24 cloud-init[4713]: util.py[DEBUG]: Running module package-update-upgrade-install (<module 'cloudinit.config.cc_package_update_upgrade_install' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_package_update_upgrade_install.pyc'>) failed
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/cloudinit/stages.py", line 660, in _run_modules
cc.run(run_name, mod.handle, func_args, freq=freq)
File "/usr/lib/python2.7/dist-packages/cloudinit/cloud.py", line 63, in run
return self._runners.run(name, functor, args, freq, clear_on_fail)
File "/usr/lib/python2.7/dist-packages/cloudinit/helpers.py", line 197, in run
results = functor(args)
File "/usr/lib/python2.7/dist-packages/cloudinit/config/cc_package_update_upgrade_install.py", line 111, in handle
raise errors[-1]
ProcessExecutionError: Unexpected error while running command.
Command: ['yum', '-t', '-y', '--exclude=kernel', '--exclude=nvidia
', '--exclude=cudatoolkit', '--security', '--sec-severity=critical', '--sec-severity=important', 'upgrade']
Exit code: 1
Reason: -
Stdout: ''
Stderr: ''
Apr 16 19:05:24 cloud-init[4713]: stages.py[DEBUG]: Running module timezone (<module 'cloudinit.config.cc_timezone' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_timezone.pyc'>) with frequency once-per-instance
Apr 16 19:05:24 cloud-init[4713]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_timezone - wb: [644] 20 bytes
Apr 16 19:05:24 cloud-init[4713]: helpers.py[DEBUG]: Running config-timezone using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_timezone'>)
Apr 16 19:05:24 cloud-init[4713]: cc_timezone.py[DEBUG]: Skipping module named timezone, no 'timezone' specified
Apr 16 19:05:24 cloud-init[4713]: stages.py[DEBUG]: Running module puppet (<module 'cloudinit.config.cc_puppet' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_puppet.pyc'>) with frequency once-per-instance
Apr 16 19:05:24 cloud-init[4713]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_puppet - wb: [644] 20 bytes
Apr 16 19:05:24 cloud-init[4713]: helpers.py[DEBUG]: Running config-puppet using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_puppet'>)
Apr 16 19:05:24 cloud-init[4713]: cc_puppet.py[DEBUG]: Skipping module named puppet, no 'puppet' configuration found
Apr 16 19:05:24 cloud-init[4713]: stages.py[DEBUG]: Running module disable-ec2-metadata (<module 'cloudinit.config.cc_disable_ec2_metadata' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_disable_ec2_metadata.pyc'>) with frequency always
Apr 16 19:05:24 cloud-init[4713]: helpers.py[DEBUG]: Running config-disable-ec2-metadata using lock (<cloudinit.helpers.DummyLock object at 0x7fb9c0cad810>)
Apr 16 19:05:24 cloud-init[4713]: cc_disable_ec2_metadata.py[DEBUG]: Skipping module named disable-ec2-metadata, disabling the ec2 route not enabled
Apr 16 19:05:24 cloud-init[4713]: stages.py[DEBUG]: Running module runcmd (<module 'cloudinit.config.cc_runcmd' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_runcmd.pyc'>) with frequency once-per-instance
Apr 16 19:05:24 cloud-init[4713]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_runcmd - wb: [644] 20 bytes
Apr 16 19:05:24 cloud-init[4713]: helpers.py[DEBUG]: Running config-runcmd using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_runcmd'>)
Apr 16 19:05:24 cloud-init[4713]: cc_runcmd.py[DEBUG]: Skipping module named runcmd, no 'runcmd' key in configuration
Apr 16 19:05:24 cloud-init[4713]: cloud-init[DEBUG]: Ran 10 modules with 2 failures
Apr 16 19:05:24 cloud-init[4713]: util.py[DEBUG]: Reading from /proc/uptime (quiet=False)
Apr 16 19:05:24 cloud-init[4713]: util.py[DEBUG]: Read 15 bytes from /proc/uptime
Apr 16 19:05:24 cloud-init[4713]: util.py[DEBUG]: cloud-init mode 'modules' took 140.880 seconds (140.88)
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Cloud-init v. 0.7.6 running 'modules:final' at Tue, 16 Apr 2019 19:08:10 +0000. Up 483.41 seconds.
Apr 16 19:08:10 cloud-init[5562]: stages.py[DEBUG]: Using distro class <class 'cloudinit.distros.amazon.Distro'>
Apr 16 19:08:10 cloud-init[5562]: stages.py[DEBUG]: Running module scripts-per-once (<module 'cloudinit.config.cc_scripts_per_once' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_per_once.pyc'>) with frequency once
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Writing to /var/lib/cloud/sem/config_scripts_per_once.once - wb: [644] 20 bytes
Apr 16 19:08:10 cloud-init[5562]: helpers.py[DEBUG]: Running config-scripts-per-once using lock (<FileLock using file '/var/lib/cloud/sem/config_scripts_per_once.once'>)
Apr 16 19:08:10 cloud-init[5562]: stages.py[DEBUG]: Running module scripts-per-boot (<module 'cloudinit.config.cc_scripts_per_boot' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_per_boot.pyc'>) with frequency always
Apr 16 19:08:10 cloud-init[5562]: helpers.py[DEBUG]: Running config-scripts-per-boot using lock (<cloudinit.helpers.DummyLock object at 0x7f35a40e08d0>)
Apr 16 19:08:10 cloud-init[5562]: stages.py[DEBUG]: Running module scripts-per-instance (<module 'cloudinit.config.cc_scripts_per_instance' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_per_instance.pyc'>) with frequency once-per-instance
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_scripts_per_instance - wb: [644] 20 bytes
Apr 16 19:08:10 cloud-init[5562]: helpers.py[DEBUG]: Running config-scripts-per-instance using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_scripts_per_instance'>)
Apr 16 19:08:10 cloud-init[5562]: stages.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) with frequency once-per-instance
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_scripts_user - wb: [644] 20 bytes
Apr 16 19:08:10 cloud-init[5562]: helpers.py[DEBUG]: Running config-scripts-user using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_scripts_user'>)
Apr 16 19:08:10 cloud-init[5562]: stages.py[DEBUG]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_ssh_authkey_fingerprints.pyc'>) with frequency once-per-instance
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_ssh_authkey_fingerprints - wb: [644] 20 bytes
Apr 16 19:08:10 cloud-init[5562]: helpers.py[DEBUG]: Running config-ssh-authkey-fingerprints using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_ssh_authkey_fingerprints'>)
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Read 3935 bytes from /etc/ssh/sshd_config
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Reading from /home/ec2-user/.ssh/authorized_keys (quiet=False)
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Read 389 bytes from /home/ec2-user/.ssh/authorized_keys
Apr 16 19:08:10 cloud-init[5562]: stages.py[DEBUG]: Running module keys-to-console (<module 'cloudinit.config.cc_keys_to_console' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_keys_to_console.pyc'>) with frequency once-per-instance
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_keys_to_console - wb: [644] 20 bytes
Apr 16 19:08:10 cloud-init[5562]: helpers.py[DEBUG]: Running config-keys-to-console using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_keys_to_console'>)
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Running command ['/usr/libexec/cloud-init/write-ssh-key-fingerprints', '', 'ssh-dss'] with allowed return codes [0] (shell=False, capture=True)
Apr 16 19:08:10 cloud-init[5562]: stages.py[DEBUG]: Running module phone-home (<module 'cloudinit.config.cc_phone_home' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_phone_home.pyc'>) with frequency once-per-instance
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_phone_home - wb: [644] 20 bytes
Apr 16 19:08:10 cloud-init[5562]: helpers.py[DEBUG]: Running config-phone-home using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_phone_home'>)
Apr 16 19:08:10 cloud-init[5562]: cc_phone_home.py[DEBUG]: Skipping module named phone-home, no 'phone_home' configuration found
Apr 16 19:08:10 cloud-init[5562]: stages.py[DEBUG]: Running module final-message (<module 'cloudinit.config.cc_final_message' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_final_message.pyc'>) with frequency always
Apr 16 19:08:10 cloud-init[5562]: helpers.py[DEBUG]: Running config-final-message using lock (<cloudinit.helpers.DummyLock object at 0x7f35a3d848d0>)
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Reading from /proc/uptime (quiet=False)
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Read 15 bytes from /proc/uptime
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Cloud-init v. 0.7.6 finished at Tue, 16 Apr 2019 19:08:10 +0000. Datasource DataSourceEc2. Up 483.59 seconds
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Writing to /var/lib/cloud/instance/boot-finished - wb: [644] 52 bytes
Apr 16 19:08:10 cloud-init[5562]: stages.py[DEBUG]: Running module power-state-change (<module 'cloudinit.config.cc_power_state_change' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_power_state_change.pyc'>) with frequency once-per-instance
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_power_state_change - wb: [644] 20 bytes
Apr 16 19:08:10 cloud-init[5562]: helpers.py[DEBUG]: Running config-power-state-change using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_power_state_change'>)
Apr 16 19:08:10 cloud-init[5562]: cc_power_state_change.py[DEBUG]: no power_state provided. doing nothing
Apr 16 19:08:10 cloud-init[5562]: cloud-init[DEBUG]: Ran 9 modules with 0 failures
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Creating symbolic link from '/run/cloud-init/result.json' => '../../var/lib/cloud/data/result.json'
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Reading from /proc/uptime (quiet=False)
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: Read 15 bytes from /proc/uptime
Apr 16 19:08:10 cloud-init[5562]: util.py[DEBUG]: cloud-init mode 'modules' took 0.182 seconds (0.18)


There is no additional bootstarp action added by me.

Edited by: BurnswickNYC on Apr 16, 2019 1:53 PM

asked 5 years ago3949 views
3 Answers
0

Hi BurnswickNYC,

Wow it really sad that no one on the EMR support or service teams have responded to your question at all yet.

What I find most helpful in such cluster spin-up situations is to specify path to an S3 bucket key prefix in the log-uri when you spin-up a cluster. Then EMR will write all the logs to that location before it terminates.

After the cluster fails to spin-up you can then look over all the logs at your leisure to find what went wrong.

I hopes this help you find some details for your cluster spin-up problem.
-Kurt

klarson
answered 5 years ago
0

Thanks for the reply, I shall try this out.

answered 5 years ago
0

Hey BurnswickNYC,

Looking at your failed cluster spin up again on EMR 5.23.0 it looks eerily similar to an EMR bug that I just work thru with EMR support in a support case. You may be hitting the same thing.

There a simple log search that can pattern match it.

Once you have your logs being written to S3 by the EMR logpusher, that even happens on failed cluster spinup, here the log search search you can do to see if you hit the same problem.

Either in the EMR console from the logs link or in S3 directly navigate to the log for the failed cluster that's in the path like:

s3://<bucket>/<prefix>/<cluster_id>/node/<node_id>/provision-node/reports/0/<some_uuid>/<node_private_DNS_name>/<timestamp>.yaml.gz

In that log file search for the string "failed: true". A pattern match to the problem I encountered will have 2 matches. The first occurrence of that string is in Puppet step "Exec[add-to-cluster-node-labels]: !ruby/object:Puppet::Resource::Status". You can see the Puppet step several log lines above the match. The second occurrence is in the very next Puppet step ""Exec[yarn rmadmin -refreshQueues]": !ruby/object:Puppet::Resource::Status". Again several log lines above the second match.

If you find this pattern you have the same EMR bug. There is no fix yet, not even in EMR 5.24.0, but there is a workaround.

To work around the bug you need to run a configuration classification for YARN config file /etc/hadoop/conf/yarn-site.xml that temporarily moves the YARN node labels for clusters on this EMR version from HDFS to node local disk in your "aws emr create-cluster ..." CLI command. You can pick anywhere on local disk, but here's what I did on recommendation of EMR support:

{"Classification":"yarn-site","Properties":{"yarn.node-labels.fs-store.root-dir":"file:///tmp/hadoop-yarn/nodelabels"}}

I hope this helps you (or others) out a bit,
-Kurt

klarson
answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions