- Più recenti
- Maggior numero di voti
- Maggior numero di commenti
Pedro, Shahad, thank you both very much for your detailed answers!
You both recommended looking into OS/Console logs to find hints towards the cause of the status check failure. However, the only accessible log I am aware of is the EC2 system log (retrieved from the console under EC2 Instance > Monitor and troubleshoot > Get system log). I compared the logs between a successful launch and a failed launch, and found that they began to differ towards the end. I think the message of interest would have to be this line: INFO: task swapper/0:1 blocked for more than 120 seconds.
I tried looking it up but I couldn't find much of anything related...
I will leave the relevant part of the system log here just in case:
(Failed instance)
[ 2.677190] EXT4-fs (nvme0n1p1): INFO: recovery required on readonly filesystem
[ 2.683335] EXT4-fs (nvme0n1p1): write access will be enabled during recovery
[ 243.644045] INFO: task swapper/0:1 blocked for more than 120 seconds.
[ 243.647869] Not tainted 5.19.0-1025-aws #26~22.04.1-Ubuntu
[ 243.651563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 243.657350] task:swapper/0 state:D stack: 0 pid: 1 ppid: 0 flags:0x00004000
[ 243.663321] Call Trace:
[ 243.665900] <TASK>
[ 243.668302] __schedule+0x254/0x5a0
[ 243.671135] schedule+0x5d/0x100
[ 243.673935] io_schedule+0x46/0x80
[ 243.676736] blk_mq_get_tag+0x117/0x300
[ 243.679674] ? destroy_sched_domains_rcu+0x40/0x40
[ 243.683027] __blk_mq_alloc_requests+0xc4/0x1e0
[ 243.686320] blk_mq_get_new_requests+0xcc/0x190
[ 243.689612] blk_mq_submit_bio+0x1eb/0x450
[ 243.692614] __submit_bio+0xf6/0x190
[ 243.695496] submit_bio_noacct_nocheck+0xc2/0x120
[ 243.698744] submit_bio_noacct+0x209/0x560
[ 243.701745] submit_bio+0x40/0xf0
[ 243.704570] submit_bh_wbc+0x134/0x170
[ 243.707478] ll_rw_block+0xbc/0xd0
[ 243.710302] do_readahead.isra.0+0x126/0x1e0
[ 243.713836] jread+0xeb/0x100
[ 243.716574] do_one_pass+0xbb/0xb90
[ 243.719457] ? crypto_create_tfm_node+0x9a/0x120
[ 243.722640] ? crc_43+0x1e/0x1e
[ 243.725371] jbd2_journal_recover+0x8d/0x150
[ 243.728505] jbd2_journal_load+0x130/0x1f0
[ 243.731565] ext4_load_journal+0x271/0x5d0
[ 243.734597] __ext4_fill_super+0x2aa1/0x2e10
[ 243.737635] ? pointer+0x36f/0x500
[ 243.740451] ext4_fill_super+0xd3/0x280
[ 243.743433] ? ext4_fill_super+0xd3/0x280
[ 243.746410] get_tree_bdev+0x189/0x280
[ 243.749329] ? __ext4_fill_super+0x2e10/0x2e10
[ 243.752433] ext4_get_tree+0x15/0x20
[ 243.755273] vfs_get_tree+0x2a/0xd0
[ 243.758074] do_new_mount+0x184/0x2e0
[ 243.760943] path_mount+0x1f3/0x890
[ 243.763754] ? putname+0x5f/0x80
[ 243.766487] init_mount+0x5e/0x9f
[ 243.769237] do_mount_root+0x8d/0x124
[ 243.772127] mount_block_root+0xd8/0x1ea
[ 243.775065] mount_root+0x62/0x6e
[ 243.777816] prepare_namespace+0x13f/0x19e
[ 243.780823] kernel_init_freeable+0x120/0x139
[ 243.783902] ? rest_init+0xe0/0xe0
[ 243.786690] kernel_init+0x1b/0x170
[ 243.789479] ? rest_init+0xe0/0xe0
[ 243.792242] ret_from_fork+0x22/0x30
[ 243.795076] </TASK>
Are there any other places specifically that I can look for logs? I can't connect to the machine because it is "unreachable" and I don't think there are any relevant logs on CloudWatch...
As for scripts, I was applying a userdata script within my LaunchConfiguration, but I tried removing it to see if anything would change. Unfortunately it seems to still fail with high frequency. Other than that I can't think of any other startup scripts I have. I run scripts in the AppSpec section of CodeDeploy, but the first one begins in the AfterInstall
phase which it never reaches during a status check failure.
The custom IAM itself I'm using is based off Ubuntu 22.04 and also ran userdata to set itself up. Essentially all it does is change the timezone to Asia/Tokyo as well as installs the following:
- docker
- docker-compose
- awscli
- python3.10-venv
- CodeDeploy Agent
- Amazon CloudWatch Agent (for metrics on Memory Usage. and Disk Usage too, but for some reason I never got Disk Usage to show up).
Thank you for confirming/explaining my questions 2 and 3 as well. I think I have a better understanding of how it works now, thanks!
That said, the LCH CodeDeploy creates has a 10 minute timeout, so if you see even the terminated instances sitting around for an hour, that implies there's a second lifecycle hook on the ASG doing something else; because the CodeDeploy Agent (running inside the instance) is the one in charge of sending record-lifecycle-action-heartbeat API calls to extend the hook for up to 1 hour for pending deployments. And if the instance is terminated, the agent couldn't be sending these calls
I attempted to confirm this... and oddly enough my instance is still shown as attached to my ASG despite being terminated for over 10 minutes (currently at 40min). Does this suggest my CodeDeploy Agent is still alive on the terminated instance? I'm not sure if this information is too important to solving my problem though.
Once again, thanks for all the advice. I would appreciate any guidance as to next steps I can take (other logs to check? Meaning behind task swapper blocked?) I suppose I could run some tests with a CodeDeploy agent only Ubuntu AMI to see if I still get status check errors...
Hello!
I understand that you are experiencing some issues with your Blue Green Deployments using CodeDeploy and an EC2 AutoScaling Group. Let's address your questions one by one:
-
The cause of the new instances failing the Status Check on launch can vary. It could be due to network connectivity issues, security group misconfigurations, or problems with the underlying infrastructure. To investigate further, you can check the CloudWatch logs and EC2 instance console output for any error messages or clues about the failure.
-
AutoScaling Groups are designed to replace unhealthy instances, but they may not always do so immediately after a failed status check. In some cases, the AutoScaling Group may wait for a grace period before replacing the instance. Additionally, if the instance is failing the status check consistently, it may not trigger an automatic replacement. You can adjust the health check settings in your AutoScaling Group to ensure that instances failing the status check are replaced promptly.
-
Manually terminating an instance in an AutoScaling Group does not automatically trigger the creation of a new instance. This behavior is expected. When you terminate an instance, it is removed from the AutoScaling Group, and you would need to manually detach it before a new instance is created to maintain the desired capacity.
Based on the information you provided, your AutoScaling Group settings seem appropriate. However, it's worth reviewing your security group configurations and checking for any network-related issues that could be causing the status check failures.
I hope this helps
What could be the cause of the new instances failing the Status Check on launch?
- As Pedro mentioned, you'll want to look into OS and Console logs. You'll also want to examine any userdata or other startup scripts your running and see if any might be entering into a race condition with each other leading to a bricked instance.
- As you mentioned, instance status checks are usually a reachability issue, and mean something isn't working right with the OS. Its not abnormal to see them not passing briefly as the OS is booting, but for a solid hour is definitely weird.
- If the System status check is passing, then its not likely an issue with the underlying hardware.
- Security Groups are applied for traffic entering/leaving the instance, and won't affect status checks
It was my understanding that when an Instance is unhealthy, AutoScaling Groups would automatically delete and recreate a new Instance for me, but this does not happen when an Instance failed the Status Check. Why is that?
- When CodeDeploy is configured on the ASG, it creates a Launching LifeCycle Hook (LCH) to trigger the deployment and prevent the ASG from registering the instance to any configured ELBs until the deployment is done. During a LCH, ASGs don't perform healthchecks on the instance since its expected the instance might be rebooted or otherwise be unhealthy during bootstrapping
Rebooting the Instance seems to do nothing, and manually terminating the Instance does not trigger AutoScaling Group to create a new instance. The ASG shows it is still attached even after termination and I have to manually detach it before it creates a new Instance. Is this expected behavior?
- For the rebooting, this also makes it sound to me like some sort of bootstrapping that's running during startup is causing an issue. If you have multiple things happening concurrently (userdata scripts + Code Deploy; multiple deployoment groups; etc
- Terminating the instance being ignored by the ASG is the same as above. AutoScaling detects terminations through Healthchecks, and since the LCH is running, the ASG isn't doing any healthchecks on that instance.
- That said, the LCH CodeDeploy creates has a 10 minute timeout, so if you see even the terminated instances sitting around for an hour, that implies there's a second lifecycle hook on the ASG doing something else; because the CodeDeploy Agent (running inside the instance) is the one in charge of sending record-lifecycle-action-heartbeat API calls to extend the hook for up to 1 hour for pending deployments. And if the instance is terminated, the agent couldn't be sending these calls
- For these instances, you can send complete-lifecycle-action from the CLI with the ABANDON result to have the ASG replace them. This isn't a solution, just something you can do to help with troubleshooting
Contenuto pertinente
- AWS UFFICIALEAggiornata un anno fa
- AWS UFFICIALEAggiornata 2 anni fa
This seems key:
The ASG didn't terminate the instance if it still shows in its list. Are these spot instances? If so, the Spot service might be reclaiming them. If not, check CloudTrail for TermianteInstances calls around the time of a failure to see where the terminate request came from
Clarifying the 2nd LCH part: There would have to be a Terminating LCH on the ASG keeping the instance in terminating:wait, since after the CodeDeploy LCH timed out without a heartbeat, the launch would be failed.
As a side note: You should look into upgrading to launch templates, since launch configs aren't receiving any new updates (and most features launched in the past 5 years require a launch template): https://aws.amazon.com/blogs/compute/amazon-ec2-auto-scaling-will-no-longer-add-support-for-new-ec2-features-to-launch-configurations/