How do I troubleshoot the "Was the task killed externally" error in Amazon MWAA?

4 minute read
0

I want to troubleshoot the "Was the task killed externally" error in Amazon Managed Workflows for Apache Airflow (Amazon MWAA).

Short description

The Was the task killed externally error occurs when the state of a task is different between the airflow metadata database and the task initiator. The following are common causes of the error:

  • The task_queued_timeout value is reached. The default value is 600 seconds. For earlier versions of Apache Airflow, view the task_adoption_timeout value. For more information, see task_queued_timeout on the Apache Airflow website.
  • The task failed because of high resource utilization on the worker.

Resolution

Check your scheduler logs

Complete the following steps:

  1. Open the Amazon CloudWatch console.

  2. In the navigation pane, select Logs.

  3. Select Log groups.

  4. Choose the log group that you want to view.

  5. Select Search All LogStream.

  6. To search the time period of your task failure, update the time interval. Also, filter the search with your task ID:

    "example-dag-name.example-task-name manual__example-time-202X-XX-XXTXX:XX:XX.758774+00:00"

    Note: Replace example-dag-name your Directed Acyclic Graphs (DAG) name, example-task-name with your task name, and example-time with the time period that you want to use.

  7. Identify two log lines in the search results that reference your task:
    Example of your queued task:

    [[34m**2024-01-17T11:19:07.487+0000**[0m] [34mscheduler_job_runner.py:[0m713 INFO[0m - Setting external_id for <TaskInstance: dag_name.task_name manual__202X-XX-XXTXX:XX:XX.758774+00:00[queued]> to 8b49b168-992d-4db6-bdc7-a143d55720c8[0m

    Example of your stopped task:

    [[34m**2024-01-17T11:30:18.936+0000**[0m] [34mscheduler_job_runner.py:[0m771 ERROR[0m - Executor reports task instance <TaskInstance: dag_name.task_name manual__202X-XX-XXTXX:XX:XX.758774+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?[0m
  8. Further troubleshoot based on the following scenarios.

Task failed because of a task_queued_timeout

Compare the timestamps of when you scheduled your task and when your task stopped. If the difference is greater than or equal to the task_queued_timeout value, then your task was queued too long.

To resolve this issue, take the following actions:

  • Increase task_queued_timeout value so that the tasks can wait longer in the queue without a timeout.
  • Upgrade to a higher environment class to increase the number of celery worker slots in each worker container. The number of concurrent tasks that can run on the environment is maxWorkers * celery.worker_autoscale.
  • Spread the load of DAGs and tasks. Don't run multiple DAGs at a time.
  • Check that your scheduler isn't overloaded. If you have an overloaded scheduler, then tasks might not be scheduled on time.
  • If insufficient worker slots aren't an issue, then increase the scheduler count to direct more resources towards tasks that need to be scheduled.
    Note: An increase the scheduler count might affect metadatabase utilization and parsing times.

If the value of task_queued_timeout isn't reached, then check your workers logs.

Complete the following steps:

  1. Access your Apache Airflow UI.
  2. Choose a DAG.
  3. Select Graph.
  4. Choose a task run.
  5. Select Instance details. Then, note the external_executor_id value of your task.
  6. Open the Amazon CloudWatch console.
  7. In the navigation pane, choose Log groups.
  8. Choose the log group that you want to view.
  9. Select Search All LogStream.
  10. To search the time period of your task failure, update the time interval.
  11. Filter the search with the external_executor_id value to view log lines that are related to your task on the worker.
  12. Identify error messages that are related to your task. For more information about the errors, choose the log stream's name.

Task failed because of high CPU or memory utilization

If you receive the following error message, then your worker has resource utilization issues, such as high CPU or RAM. As a result, the worker process that runs on the worker container fails and exits prematurely.

"2023-07-26 13:00:49,356: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM) Job: 1049.')"

To troubleshoot the preceding error message, check the CPUUtilization and MemoryUtilization metrics. If the metrics are constantly high or have spikes, then your Amazon MWAA workers are overloaded.

To resolve this issue, take the following actions:

  • Decrease the celery.worker_autoscale value to reduce the number of tasks that run concurrently on your worker.
  • Use a higher Amazon MWAA instance class for more RAM and vCPUs.
  • Rewrite your DAGs to offload the compute workload from Amazon MWAA to other compute platforms.

Related information

Best practices on the Apache Airflow website

AWS OFFICIAL
AWS OFFICIALUpdated a month ago