Skip to content

MWAA Tasks failing while queued (state mismatch)

0

Recently without any changes done to my MWAA environment or the tasks running, tasks now fail with a high probability as soon as their initially run without any logs being generated in the worker or task log streams. The event log all I can see is

Executor CeleryExecutor(parallelism=12) reported that the task instance <TaskInstance:task__2024-11-01T00:00:00+00:00 [queued]> finished with state failed, but the task instance's state attribute is queued. Learn more: https://airflow.apache.org/docs/apache-airflow/stable/troubleshooting.html#task-state-changed-externally

However, the strange this is that occasionally they will run, without any changes to the config or the task itself. Currently I have each task set to a ridiculously large retry count so that eventually they each pass.

I tried playing around with configuration values like celery.worker_autoscale = 5,0

Or setting a minimum of 2,3 or 4 minimum workers but that hasn't helped. Also setting the same maximum workers in case there's a problem with the autoscaling.

I'm running airflow 2.10.1 with 3 minimum & maximum nodes. This error persists when I create new environments with only 1 DAG.

MWAA Log

1 Answer
-1

The issue you're experiencing with tasks failing while queued might be due to the below common reasons

There are two common causes for this error:

  1. The task_queued_timeout value is reached. By default, this is set to 600 seconds.
  2. The task fails due to high resource utilization on the worker.

To troubleshoot this issue, you can take the following steps:

  1. Check the scheduler logs in Amazon CloudWatch. Look for two log lines referencing the task - one when it was queued and another when it stopped. Compare the timestamps to see if the difference is greater than or equal to the task_queued_timeout value.

  2. If the task_queued_timeout is being reached, you can try:

    • Increasing the task_queued_timeout value
    • Upgrading to a higher environment class
    • Spreading the load of DAGs and tasks
    • Checking if the scheduler is overloaded
  3. If the task_queued_timeout isn't being reached, examine the worker logs for more information.

  4. Check the CPUUtilization and MemoryUtilization metrics in CloudWatch. If they are consistently high or show spikes, you may need to:

    • Decrease the celery.worker_autoscale value
    • Use a higher MWAA instance class
    • Rewrite DAGs to offload compute workload from MWAA to other compute platforms
  5. Consider increasing the number of worker slots, which you've already attempted by adjusting the celery.worker_autoscale setting.

  6. Monitor the scheduler and worker logs closely for any additional insights.

  7. Set up alerts for task failures to help you identify issues more quickly.

The fact that tasks occasionally run successfully without changes suggests that the issue might be related to intermittent resource constraints or timing issues. Continue to monitor the environment closely and consider reaching out to AWS support if the problem persists after trying these troubleshooting steps.
Sources
Troubleshoot Was the task killed externally error in MWAA | AWS re:Post
Performance tuning for Apache Airflow on Amazon MWAA - Amazon Managed Workflows for Apache Airflow

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.