Skip to content

Gracefully scale down AWS ASG

0

I have a AWS Auto Scaling Group for running self-host Github runner. There is a fixed time-frame within a day, the usage of ASG is under utilization, therefore I want to scale down in this time-frame to save cost and scale up again later.

I plan to use something like Schedule scaling. For scale up, it's fine. But for scale down, Schedule scaling may terminate nodes that are running Github Actions -> failed jobs.

Is there any way to wait running jobs finished (obvious cordon node meanwhile) before terminate nodes?

asked 2 years ago503 views
2 Answers
0

1. Implement Lifecycle Hooks:

Lifecycle hooks can pause the termination process of an instance in the ASG, allowing you to check whether any GitHub Actions jobs are running before proceeding with termination. You can configure a lifecycle hook for the Terminating:Wait event, which will put the instance in a wait state when it is scheduled for termination.

2. Monitor GitHub Runner Jobs: You need to implement a mechanism to check if the GitHub runner on the instance is running any jobs.

If the instance is running a job, you can mark it as busy and postpone its termination by controlling the lifecycle hook.

3. Cordon the Node:

Before allowing the instance to terminate, you can "cordon" the node by disabling new jobs from being assigned to it.

GitHub provides an API to remove a runner from the pool temporarily, which will prevent new jobs from being assigned to that runner.

4. Complete Lifecycle Hook:

Once all running jobs on the instance are finished, you can complete the lifecycle action, allowing the instance to terminate gracefully.

Example Workflow:

Create a Lambda function that is triggered by the lifecycle hook when the instance is scheduled for termination.

Check if the instance is running any GitHub jobs by querying the GitHub API.

If jobs are running, postpone termination and cordon the node.

If no jobs are running, complete the lifecycle action and terminate the instance.

Example Code Snippet:

Here’s an example of how you might configure the Lambda function:

import boto3
import requests

# Your GitHub API token and runner information
GITHUB_API_TOKEN = "your_github_token"
RUNNER_ID = "your_runner_id"

def lambda_handler(event, context):
    # Check GitHub runner status
    headers = {"Authorization": f"token {GITHUB_API_TOKEN}"}
    response = requests.get(f"https://api.github.com/repos/your_repo/actions/runners/{RUNNER_ID}", headers=headers)
    runner_data = response.json()
    
    if runner_data['busy']:
        # Runner is busy, postpone termination
        print("Runner is busy, postponing termination")
        return
    else:
        # Runner is idle, allow termination
        print("Runner is idle, proceeding with termination")
        asg_client = boto3.client('autoscaling')
        asg_client.complete_lifecycle_action(
            LifecycleHookName=event['LifecycleHookName'],
            AutoScalingGroupName=event['AutoScalingGroupName'],
            LifecycleActionToken=event['LifecycleActionToken'],
            LifecycleActionResult='CONTINUE'
        )

Additional Considerations:

Schedule Scaling: You can still use scheduled scaling to define the time frame for scaling up and down. The above process ensures that scaling down is done gracefully.

Error Handling: Implement error handling and retries for API calls and lifecycle hook completions.

https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-scaling-cooldowns.html

EXPERT
answered 2 years ago
0

Hello,

check these steps once may be helpful

Use Lifecycle Hooks with Scheduled Scaling

  1. Set Up a Lifecycle Hook:
  • Create a lifecycle hook for the "Terminating" state in your ASG. This will pause instance termination, allowing you to check if any GitHub Actions jobs are running.

2.Create a Lambda Function:

  • Trigger this function via the lifecycle hook.
  • The function should check if any jobs are running. If so, keep the instance in a "Wait" state.
  • Cordon the instance so no new jobs start.
  • Once jobs are completed, signal the lifecycle hook to proceed with termination.
  1. Combine with Scheduled Scaling:
  • Use scheduled scaling to adjust your ASG size based on your usage time frames, ensuring cost savings without disrupting jobs.

https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks-overview.html

EXPERT
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.