CodeDeploy EC2 AutoScaling Group Instances failing Status Check

0

Hello. This is my first post so sorry for any potential misconduct.

I am using CodeDeploy to run Blue Green Deployments with an EC2 AutoScaling Group as part of a CodePipeline execution. The general flow is that CodeDeploy will clone my existing AutoScaling group into a new one, and when the instances are healthy in the new one it will install my application to it and attempt to reroute traffic to it when I choose to do so.

The problem is that on occasion, the newly generated EC2s from the newly cloned AutoScaling group will fail one of the EC2 Status Checks (namely the Instance Status checks - Instance reachability check failed). When this happens, the entire deployment just hangs until the 60min time limit causes CodeDeploy to fail.

This originally happened maybe once every 10 times so I would just get annoyed, delete the autoscaling group, and redo the CodeDeploy deployment instead of waiting for the timeout. But recently I've encountered this problem occurring as much as 5 times in a row with no clear reason (succeeded on the 6th). Thus, I have several questions:

  1. What could be the cause of the new instances failing the Status Check on launch?
  2. It was my understanding that when an Instance is unhealthy, AutoScaling Groups would automatically delete and recreate a new Instance for me, but this does not happen when an Instance failed the Status Check. Why is that?
  3. Rebooting the Instance seems to do nothing, and manually terminating the Instance does not trigger AutoScaling Group to create a new instance. The ASG shows it is still attached even after termination and I have to manually detach it before it creates a new Instance. Is this expected behavior?

AutoScaling Group settings:

  • Minimum Capacity 1
  • Maximum Capacity 2
  • Desired Capacity 1
  • Instance Size: t3.large
  • AMI: Custom AMI

Sorry for the long post. I would appreciate any advice or insight. Thank you.

profile picture
asked 9 months ago861 views
3 Answers
0

Pedro, Shahad, thank you both very much for your detailed answers!

You both recommended looking into OS/Console logs to find hints towards the cause of the status check failure. However, the only accessible log I am aware of is the EC2 system log (retrieved from the console under EC2 Instance > Monitor and troubleshoot > Get system log). I compared the logs between a successful launch and a failed launch, and found that they began to differ towards the end. I think the message of interest would have to be this line: INFO: task swapper/0:1 blocked for more than 120 seconds. I tried looking it up but I couldn't find much of anything related...

I will leave the relevant part of the system log here just in case:

(Failed instance)

[    2.677190] EXT4-fs (nvme0n1p1): INFO: recovery required on readonly filesystem
[    2.683335] EXT4-fs (nvme0n1p1): write access will be enabled during recovery
[  243.644045] INFO: task swapper/0:1 blocked for more than 120 seconds.
[  243.647869]       Not tainted 5.19.0-1025-aws #26~22.04.1-Ubuntu
[  243.651563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  243.657350] task:swapper/0       state:D stack:    0 pid:    1 ppid:     0 flags:0x00004000
[  243.663321] Call Trace:
[  243.665900]  <TASK>
[  243.668302]  __schedule+0x254/0x5a0
[  243.671135]  schedule+0x5d/0x100
[  243.673935]  io_schedule+0x46/0x80
[  243.676736]  blk_mq_get_tag+0x117/0x300
[  243.679674]  ? destroy_sched_domains_rcu+0x40/0x40
[  243.683027]  __blk_mq_alloc_requests+0xc4/0x1e0
[  243.686320]  blk_mq_get_new_requests+0xcc/0x190
[  243.689612]  blk_mq_submit_bio+0x1eb/0x450
[  243.692614]  __submit_bio+0xf6/0x190
[  243.695496]  submit_bio_noacct_nocheck+0xc2/0x120
[  243.698744]  submit_bio_noacct+0x209/0x560
[  243.701745]  submit_bio+0x40/0xf0
[  243.704570]  submit_bh_wbc+0x134/0x170
[  243.707478]  ll_rw_block+0xbc/0xd0
[  243.710302]  do_readahead.isra.0+0x126/0x1e0
[  243.713836]  jread+0xeb/0x100
[  243.716574]  do_one_pass+0xbb/0xb90
[  243.719457]  ? crypto_create_tfm_node+0x9a/0x120
[  243.722640]  ? crc_43+0x1e/0x1e
[  243.725371]  jbd2_journal_recover+0x8d/0x150
[  243.728505]  jbd2_journal_load+0x130/0x1f0
[  243.731565]  ext4_load_journal+0x271/0x5d0
[  243.734597]  __ext4_fill_super+0x2aa1/0x2e10
[  243.737635]  ? pointer+0x36f/0x500
[  243.740451]  ext4_fill_super+0xd3/0x280
[  243.743433]  ? ext4_fill_super+0xd3/0x280
[  243.746410]  get_tree_bdev+0x189/0x280
[  243.749329]  ? __ext4_fill_super+0x2e10/0x2e10
[  243.752433]  ext4_get_tree+0x15/0x20
[  243.755273]  vfs_get_tree+0x2a/0xd0
[  243.758074]  do_new_mount+0x184/0x2e0
[  243.760943]  path_mount+0x1f3/0x890
[  243.763754]  ? putname+0x5f/0x80
[  243.766487]  init_mount+0x5e/0x9f
[  243.769237]  do_mount_root+0x8d/0x124
[  243.772127]  mount_block_root+0xd8/0x1ea
[  243.775065]  mount_root+0x62/0x6e
[  243.777816]  prepare_namespace+0x13f/0x19e
[  243.780823]  kernel_init_freeable+0x120/0x139
[  243.783902]  ? rest_init+0xe0/0xe0
[  243.786690]  kernel_init+0x1b/0x170
[  243.789479]  ? rest_init+0xe0/0xe0
[  243.792242]  ret_from_fork+0x22/0x30
[  243.795076]  </TASK>

Are there any other places specifically that I can look for logs? I can't connect to the machine because it is "unreachable" and I don't think there are any relevant logs on CloudWatch...

As for scripts, I was applying a userdata script within my LaunchConfiguration, but I tried removing it to see if anything would change. Unfortunately it seems to still fail with high frequency. Other than that I can't think of any other startup scripts I have. I run scripts in the AppSpec section of CodeDeploy, but the first one begins in the AfterInstall phase which it never reaches during a status check failure.

The custom IAM itself I'm using is based off Ubuntu 22.04 and also ran userdata to set itself up. Essentially all it does is change the timezone to Asia/Tokyo as well as installs the following:

  • docker
  • docker-compose
  • awscli
  • python3.10-venv
  • CodeDeploy Agent
  • Amazon CloudWatch Agent (for metrics on Memory Usage. and Disk Usage too, but for some reason I never got Disk Usage to show up).

Thank you for confirming/explaining my questions 2 and 3 as well. I think I have a better understanding of how it works now, thanks!

That said, the LCH CodeDeploy creates has a 10 minute timeout, so if you see even the terminated instances sitting around for an hour, that implies there's a second lifecycle hook on the ASG doing something else; because the CodeDeploy Agent (running inside the instance) is the one in charge of sending record-lifecycle-action-heartbeat API calls to extend the hook for up to 1 hour for pending deployments. And if the instance is terminated, the agent couldn't be sending these calls

I attempted to confirm this... and oddly enough my instance is still shown as attached to my ASG despite being terminated for over 10 minutes (currently at 40min). Does this suggest my CodeDeploy Agent is still alive on the terminated instance? I'm not sure if this information is too important to solving my problem though.

Once again, thanks for all the advice. I would appreciate any guidance as to next steps I can take (other logs to check? Meaning behind task swapper blocked?) I suppose I could run some tests with a CodeDeploy agent only Ubuntu AMI to see if I still get status check errors...

profile picture
answered 9 months ago
  • This seems key:

    my instance is still shown as attached to my ASG despite being terminated for over 10 minutes

    The ASG didn't terminate the instance if it still shows in its list. Are these spot instances? If so, the Spot service might be reclaiming them. If not, check CloudTrail for TermianteInstances calls around the time of a failure to see where the terminate request came from

    Clarifying the 2nd LCH part: There would have to be a Terminating LCH on the ASG keeping the instance in terminating:wait, since after the CodeDeploy LCH timed out without a heartbeat, the launch would be failed.

  • As a side note: You should look into upgrading to launch templates, since launch configs aren't receiving any new updates (and most features launched in the past 5 years require a launch template): https://aws.amazon.com/blogs/compute/amazon-ec2-auto-scaling-will-no-longer-add-support-for-new-ec2-features-to-launch-configurations/

0

Hello!

I understand that you are experiencing some issues with your Blue Green Deployments using CodeDeploy and an EC2 AutoScaling Group. Let's address your questions one by one:

  1. The cause of the new instances failing the Status Check on launch can vary. It could be due to network connectivity issues, security group misconfigurations, or problems with the underlying infrastructure. To investigate further, you can check the CloudWatch logs and EC2 instance console output for any error messages or clues about the failure.

  2. AutoScaling Groups are designed to replace unhealthy instances, but they may not always do so immediately after a failed status check. In some cases, the AutoScaling Group may wait for a grace period before replacing the instance. Additionally, if the instance is failing the status check consistently, it may not trigger an automatic replacement. You can adjust the health check settings in your AutoScaling Group to ensure that instances failing the status check are replaced promptly.

  3. Manually terminating an instance in an AutoScaling Group does not automatically trigger the creation of a new instance. This behavior is expected. When you terminate an instance, it is removed from the AutoScaling Group, and you would need to manually detach it before a new instance is created to maintain the desired capacity.

Based on the information you provided, your AutoScaling Group settings seem appropriate. However, it's worth reviewing your security group configurations and checking for any network-related issues that could be causing the status check failures.

I hope this helps

profile picture
answered 9 months ago
0

What could be the cause of the new instances failing the Status Check on launch?

  • As Pedro mentioned, you'll want to look into OS and Console logs. You'll also want to examine any userdata or other startup scripts your running and see if any might be entering into a race condition with each other leading to a bricked instance.
  • As you mentioned, instance status checks are usually a reachability issue, and mean something isn't working right with the OS. Its not abnormal to see them not passing briefly as the OS is booting, but for a solid hour is definitely weird.
  • If the System status check is passing, then its not likely an issue with the underlying hardware.
  • Security Groups are applied for traffic entering/leaving the instance, and won't affect status checks

It was my understanding that when an Instance is unhealthy, AutoScaling Groups would automatically delete and recreate a new Instance for me, but this does not happen when an Instance failed the Status Check. Why is that?

  • When CodeDeploy is configured on the ASG, it creates a Launching LifeCycle Hook (LCH) to trigger the deployment and prevent the ASG from registering the instance to any configured ELBs until the deployment is done. During a LCH, ASGs don't perform healthchecks on the instance since its expected the instance might be rebooted or otherwise be unhealthy during bootstrapping

Rebooting the Instance seems to do nothing, and manually terminating the Instance does not trigger AutoScaling Group to create a new instance. The ASG shows it is still attached even after termination and I have to manually detach it before it creates a new Instance. Is this expected behavior?

  • For the rebooting, this also makes it sound to me like some sort of bootstrapping that's running during startup is causing an issue. If you have multiple things happening concurrently (userdata scripts + Code Deploy; multiple deployoment groups; etc
  • Terminating the instance being ignored by the ASG is the same as above. AutoScaling detects terminations through Healthchecks, and since the LCH is running, the ASG isn't doing any healthchecks on that instance.
    • That said, the LCH CodeDeploy creates has a 10 minute timeout, so if you see even the terminated instances sitting around for an hour, that implies there's a second lifecycle hook on the ASG doing something else; because the CodeDeploy Agent (running inside the instance) is the one in charge of sending record-lifecycle-action-heartbeat API calls to extend the hook for up to 1 hour for pending deployments. And if the instance is terminated, the agent couldn't be sending these calls
  • For these instances, you can send complete-lifecycle-action from the CLI with the ABANDON result to have the ASG replace them. This isn't a solution, just something you can do to help with troubleshooting
AWS
answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions