Skip to content

Airflow MWAA autoscaling rate too fast

1

We migrated our Airflow DAGs from an environment running 2.5 to a new environment running 2.10.3.

After this we noticed the rate of autoscaling of workers drastically increased. no. workers every 5 mins

The orange line represents the worker count every 5 mins in our old environment, the blue is our new environment.

This seemingly lead to jobs being 'lost' - i.e. a job failing without any obvious reason, with no error messages in any Airflow log groups. For example, when we trigger an ECS container to run a script with Airflow, it runs successfully in ECS, with complete log trail, but the job will be marked as failed for no apparent reason in Airflow; the airflow task just mark the jobs as started, then the logs stop and the task is marked as failed.

Potentially this is happening due to a known issue documented by AWS here: https://docs.aws.amazon.com/mwaa/latest/userguide/t-apache-airflow-202.html#troubleshooting-tasks-202 In step 3, a similar issue is mentioned, and the resolution is to either increase the min worker count or turn off autoscaling.

Obviously we don't want to implement either of those suggestions long term due to cost reasons. My question is, why has the rate of worker autoscaling changed drastically since our migration to a new, upgraded MWAA environment?

Are there any configuration options we can use to reduce the rate of worker autoscaling i.e. have the workers hang around for longer after their tasks are completed before being terminated, e.g. some sort of 'graceful shutdown' period?

2 Answers
0

this is indeed a known issue with MWAA. It can occur when tasks are deleted mid-execution due to rapid autoscaling, especially when there's a brief period of high task load followed by a lull in activity.

Please note the the problem with the Autoscaling refered by the agent is no longer relevant as the MWAA Auto-scaling has been improved

If workers pick up new tasks while downscaling, Amazon MWAA keeps the Fargate resource and does not remove the worker. For more information, see How Amazon MWAA auto scaling works. 

https://docs.aws.amazon.com/mwaa/latest/userguide/mwaa-autoscaling.html

other points are still valid

AWS
EXPERT
answered a year ago
-1

The drastic change in worker autoscaling rate after migrating to a new MWAA environment running Airflow 2.10.3 could be due to several factors:

  1. Different default configurations: The new Airflow version might have different default settings for autoscaling, which could lead to more aggressive scaling behavior.

  2. Changes in workload patterns: If your DAGs or tasks have changed, it might trigger different scaling behaviors.

  3. Improved autoscaling algorithms: Newer versions of MWAA might have more responsive autoscaling algorithms.

To address the issue of jobs being 'lost' or failing without apparent reason, this is indeed a known issue with MWAA. It can occur when tasks are deleted mid-execution due to rapid autoscaling, especially when there's a brief period of high task load followed by a lull in activity.

While increasing the minimum worker count or disabling autoscaling are potential solutions, they're not ideal for cost reasons as you mentioned. Here are some alternatives to consider:

  1. Adjust autoscaling parameters: MWAA uses RunningTasks and QueuedTasks metrics for worker autoscaling. You might be able to adjust these parameters to make scaling less aggressive.

  2. Stagger DAG schedules: By spreading out your workload more evenly, you can reduce sudden spikes in demand that trigger rapid scaling.

  3. Optimize DAGs: Simplify complex DAGs and reduce the number of tasks where possible to minimize the load on workers.

  4. Use deferrable operators: These can help reduce the need for constant worker availability.

  5. Monitor and tune performance: Keep an eye on CPU utilization and memory usage. If these are consistently high, it might explain the frequent scaling.

  6. Adjust task queuing timeout: If tasks are failing due to queuing timeouts, you might need to increase the task_queued_timeout value.

Unfortunately, MWAA doesn't currently offer a built-in 'graceful shutdown' period for workers. The service aims to optimize cost by removing unnecessary workers quickly. However, by implementing some of the above strategies, you should be able to achieve a more stable and cost-effective autoscaling behavior.
Sources
Troubleshooting: DAGs, Operators, Connections, and other issues in Apache Airflow v2 - Amazon Managed Workflows for Apache Airflow
Optimize cost and performance for Amazon MWAA | AWS Big Data Blog
Troubleshoot Was the task killed externally error in MWAA | AWS re:Post

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.