- Newest
- Most votes
- Most comments
The issue you're experiencing with tasks failing while queued might be due to the below common reasons
There are two common causes for this error:
- The task_queued_timeout value is reached. By default, this is set to 600 seconds.
- The task fails due to high resource utilization on the worker.
To troubleshoot this issue, you can take the following steps:
-
Check the scheduler logs in Amazon CloudWatch. Look for two log lines referencing the task - one when it was queued and another when it stopped. Compare the timestamps to see if the difference is greater than or equal to the task_queued_timeout value.
-
If the task_queued_timeout is being reached, you can try:
- Increasing the task_queued_timeout value
- Upgrading to a higher environment class
- Spreading the load of DAGs and tasks
- Checking if the scheduler is overloaded
-
If the task_queued_timeout isn't being reached, examine the worker logs for more information.
-
Check the CPUUtilization and MemoryUtilization metrics in CloudWatch. If they are consistently high or show spikes, you may need to:
- Decrease the celery.worker_autoscale value
- Use a higher MWAA instance class
- Rewrite DAGs to offload compute workload from MWAA to other compute platforms
-
Consider increasing the number of worker slots, which you've already attempted by adjusting the celery.worker_autoscale setting.
-
Monitor the scheduler and worker logs closely for any additional insights.
-
Set up alerts for task failures to help you identify issues more quickly.
The fact that tasks occasionally run successfully without changes suggests that the issue might be related to intermittent resource constraints or timing issues. Continue to monitor the environment closely and consider reaching out to AWS support if the problem persists after trying these troubleshooting steps.
Sources
Troubleshoot Was the task killed externally error in MWAA | AWS re:Post
Performance tuning for Apache Airflow on Amazon MWAA - Amazon Managed Workflows for Apache Airflow
Relevant content
- asked 4 years ago
- asked 2 years ago
- asked 2 years ago
