Skip to content

On-Demand ASG Fallback Implementation / Verification

0

Hello AWS Team,

I am trying to verify a safe and conservative way to fallback from Spot to On-Demand instances in an Auto Scaling Group (ASG) when spot capacity is unavailable. Our current approach is: 1 Set up SNS events for INSTANCE_LAUNCH_ERROR. 2 Trigger a Lambda function upon this event to update the ASG and increase the On-Demand instance count. Lambda filters events to only run if it is spot capacity related fyi.

This method relies on the assumption that the INSTANCE_LAUNCH_ERROR event is triggered immediately when a replacement Spot instance cannot be launched—before the existing Spot instance is terminated. This allows the new on-demand instance time to startup before the spot is terminated just as capacity rebalancing attempts to do for spot instances. All instances for the given asg startup roughly under 2 minutes so theoretically there should be no downtime if we can react quickly and fallback to on-demand instances using some type of signal from the asg that we need to use more on-demand instances.

Our assumption is that because asg’s with capacity rebalancing attempt to launch a new spot instance before the current spot instance is terminated when a rebalance recommendation or termination notification is received, that there should be a INSTANCE_LAUNCH_ERROR when no spot instances are available. However, we would like verification on this as we cannot find this information in aws documentation.

Key Questions: ** 1 Timing of INSTANCE_LAUNCH_ERROR Event** ◦ If an ASG with Capacity Rebalancing enabled receives a rebalance recommendation and attempts to launch a replacement Spot instance—but no Spot capacity is available—does the INSTANCE_LAUNCH_ERROR event fire immediately, or only after the existing Spot instance is fully terminated? ◦ How long does it typically take to receive this event in such scenarios? ◦ Is there a way to configure the asg to fire the INSTANCE_LAUNCH_ERROR event immediately if it doesn’t by default?

** 2 Recommended Alternatives** ◦ If the above logic does not work reliably, what alternative mechanisms would you suggest? ◦ Are there proven non-native solutions that effectively handle fallback from Spot to On-Demand when Spot capacity is unavailable?

I understand that expanding AZs and increasing instance diversity improves Spot availability, and we are already doing this. However, we need a reliable fallback mechanism when Spot instances cannot be acquired. It seems this is a very common situation for many aws clients.

I appreciate your insights on this and look forward to your recommendations.

Best,


Chris

2 Answers
0

Your approach to fallback from Spot to On-Demand instances in an Auto Scaling Group (ASG) is on the right track, but there are some important considerations and potential improvements to consider.

Regarding the timing of the INSTANCE_LAUNCH_ERROR event:

  1. The ASG does not immediately create an INSTANCE_LAUNCH_ERROR event when it receives a capacity rebalance recommendation or spot termination notification. Instead, it continues attempting to launch new Spot Instances in other enabled Availability Zones.

  2. The ASG persists in trying to launch new instances across different Availability Zones until it successfully launches new instances or until the old instances are forcibly terminated. This process can take some time, and there's no immediate error event generated.

  3. There isn't a way to configure the ASG to fire the INSTANCE_LAUNCH_ERROR event immediately if it doesn't by default.

Given this behavior, your current approach may not be as reliable or immediate as you're hoping for. Here are some recommendations for a more robust solution:

  1. Implement Capacity Rebalancing: This feature proactively attempts to replace Spot Instances that receive a rebalance recommendation before they are interrupted.

  2. Use lifecycle hooks: Develop custom actions to gracefully handle instance replacements. This can give you more control over the process.

  3. Monitor rebalance recommendations and interruption notices using Amazon EventBridge: This can trigger checkpoints for your workload or handle interruptions more gracefully.

  4. Consider using a mixed instances policy in your ASG: This allows you to specify both Spot and On-Demand instances, giving you more flexibility.

  5. Implement automated scaling policies: Use CloudWatch alarms to monitor your traffic patterns and set up auto scaling policies to automatically adjust the mix of Spot and On-Demand instances based on availability and demand.

  6. Use weighted target group routing with an Application Load Balancer: This can help you gradually shift traffic between Spot and On-Demand instances during scaling operations.

While there isn't a native AWS solution that perfectly fits your requirements, combining these strategies can create a more reliable and responsive system for handling Spot capacity issues and falling back to On-Demand instances when necessary.

Remember to thoroughly test any implementation in a non-production environment before deploying to production, as the behavior can be complex and may vary depending on your specific configuration and workload.

Sources
Sequence of events when an asg receives a capacity rebalance recommendation or spot termination notification | AWS re:Post
Automating EC2 Instance Scaling for Variable Traffic with Zero Downtime | AWS re:Post

answered 10 months ago
EXPERT
reviewed 10 months ago
0

Refer to https://medium.com/@ranshn/using-preferred-instance-types-availability-zones-or-purchase-options-with-ec2-auto-scaling-groups-1a5997f3cb0d Your use case matches to point 4 under **"So when would you want to use this approach?"

AWS
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.