- Newest
- Most votes
- Most comments
this is indeed a known issue with MWAA. It can occur when tasks are deleted mid-execution due to rapid autoscaling, especially when there's a brief period of high task load followed by a lull in activity.
Please note the the problem with the Autoscaling refered by the agent is no longer relevant as the MWAA Auto-scaling has been improved
If workers pick up new tasks while downscaling, Amazon MWAA keeps the Fargate resource and does not remove the worker. For more information, see How Amazon MWAA auto scaling works.
https://docs.aws.amazon.com/mwaa/latest/userguide/mwaa-autoscaling.html
other points are still valid
The drastic change in worker autoscaling rate after migrating to a new MWAA environment running Airflow 2.10.3 could be due to several factors:
-
Different default configurations: The new Airflow version might have different default settings for autoscaling, which could lead to more aggressive scaling behavior.
-
Changes in workload patterns: If your DAGs or tasks have changed, it might trigger different scaling behaviors.
-
Improved autoscaling algorithms: Newer versions of MWAA might have more responsive autoscaling algorithms.
To address the issue of jobs being 'lost' or failing without apparent reason, this is indeed a known issue with MWAA. It can occur when tasks are deleted mid-execution due to rapid autoscaling, especially when there's a brief period of high task load followed by a lull in activity.
While increasing the minimum worker count or disabling autoscaling are potential solutions, they're not ideal for cost reasons as you mentioned. Here are some alternatives to consider:
-
Adjust autoscaling parameters: MWAA uses RunningTasks and QueuedTasks metrics for worker autoscaling. You might be able to adjust these parameters to make scaling less aggressive.
-
Stagger DAG schedules: By spreading out your workload more evenly, you can reduce sudden spikes in demand that trigger rapid scaling.
-
Optimize DAGs: Simplify complex DAGs and reduce the number of tasks where possible to minimize the load on workers.
-
Use deferrable operators: These can help reduce the need for constant worker availability.
-
Monitor and tune performance: Keep an eye on CPU utilization and memory usage. If these are consistently high, it might explain the frequent scaling.
-
Adjust task queuing timeout: If tasks are failing due to queuing timeouts, you might need to increase the task_queued_timeout value.
Unfortunately, MWAA doesn't currently offer a built-in 'graceful shutdown' period for workers. The service aims to optimize cost by removing unnecessary workers quickly. However, by implementing some of the above strategies, you should be able to achieve a more stable and cost-effective autoscaling behavior.
Sources
Troubleshooting: DAGs, Operators, Connections, and other issues in Apache Airflow v2 - Amazon Managed Workflows for Apache Airflow
Optimize cost and performance for Amazon MWAA | AWS Big Data Blog
Troubleshoot Was the task killed externally error in MWAA | AWS re:Post
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated 3 months ago
