Skip to content

How do I troubleshoot the "Was the task killed externally" error in Amazon MWAA?

4 minute read
0

I want to troubleshoot the "Was the task killed externally" error in Amazon Managed Workflows for Apache Airflow (Amazon MWAA).

Short description

The "Was the task killed externally" error occurs when the state of a task is different between the Airflow metadata database and the task initiator. The following are causes of the error:

  • The task_queued_timeout value is reached. The default value is 600 seconds. For earlier versions of Apache Airflow, view the task_adoption_timeout value. For more information, see task_queued_timeout_check_interval on the Apache Airflow website.
  • The task failed because of high resource utilization on the worker.

Resolution

Check your scheduler logs

Complete the following steps:

  1. Open the Amazon CloudWatch console.

  2. In the navigation pane, choose Logs.

  3. Choose Log groups.

  4. Choose the log group that you want to view.

  5. Choose Search All LogStream.

  6. To search the time period of your task failure, update the time interval. Also, filter the search with your task ID:

    "example-dag-name.example-task-name manual__example-time-202X-XX-XXTXX:XX:XX.758774+00:00"

    Note: Replace example-dag-name with your directed acyclic graph (DAG) name, example-task-name with your task name, and example-time with the time period that you want to use.

  7. Identify two log lines in the search results that reference your task:

    The following is an example of your queued task:

    [[34m**2024-01-17T11:19:07.487+0000**[0m] [34mscheduler_job_runner.py:[0m713 INFO[0m - Setting external_id for <TaskInstance: dag_name.task_name manual__202X-XX-XXTXX:XX:XX.758774+00:00[queued]> to 8b49b168-992d-4db6-bdc7-a143d55720c8[0m

    The following is an example of your stopped task:

    [[34m**2024-01-17T11:30:18.936+0000**[0m] [34mscheduler_job_runner.py:[0m771 ERROR[0m - Executor reports task instance <TaskInstance: dag_name.task_name manual__202X-XX-XXTXX:XX:XX.758774+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?[0m

You can troubleshoot further for the following scenarios.

Task failed because of a task_queued_timeout

Compare the timestamps of when you scheduled your task and when your task stopped. If the difference is greater than or equal to the task_queued_timeout value, then your task is queued for too long.

To resolve this issue, take the following actions:

  • Increase the task_queued_timeout value so that the tasks can wait longer in the queue without a timeout.
  • Upgrade to a higher environment class to increase the number of celery worker slots in each worker container. The number of concurrent tasks that can run on the environment is maxWorkers * celery.worker_autoscale.
  • Spread the load of DAGs and tasks. Don't run multiple DAGs at a time.
  • Check that your scheduler isn't overloaded. If you have an overloaded scheduler, then tasks might not be scheduled on time.

Note: An increase of the scheduler count might affect meta database utilization and parsing times. More schedulers increase high availability (HA), but doesn't add more resources for tasks scheduling. If the value of task_queued_timeout isn't reached, then check your workers logs.

To check your workers logs, complete the following steps:

  1. Access your Apache Airflow UI.
  2. Choose a DAG.
  3. Select Graph.
  4. Choose a task run.
  5. Choose Instance details. Then, note the external_executor_id value of your task.
  6. Open the Amazon CloudWatch console.
  7. In the navigation pane, choose Logs.
  8. Choose Log groups.
  9. Choose the log group that you want to view.
  10. Choose Search All LogStream.
  11. To search the time period of your task failure, update the time interval.
  12. Filter the search with the external_executor_id value to view log lines that are related to your task on the worker.
  13. Identify error messages that are related to your task. For more information about the errors, choose the log stream's name.

Task failed because of high CPU or memory utilization

If you receive the following error message, then your worker has resource utilization issues, such as high CPU or RAM. As a result, the worker process that runs on the worker container fails and exits prematurely.

"[2023-07-26 13](tel:2023072613):00:49,356: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM) Job: 1049.')"

To troubleshoot the previous error message, check the CPUUtilization and MemoryUtilization metrics. If the metrics are constantly high or have spikes, then your Amazon MWAA workers are overloaded.

To resolve overloaded workers, take the following actions:

  • Decrease the celery.worker_autoscale value to reduce the number of tasks that run concurrently on your worker.
  • Use a higher Amazon MWAA instance class for more RAM and vCPUs.
  • Rewrite your DAGs to offload the compute workload from Amazon MWAA to other compute platforms.

Related information

Best practices on the Apache Airflow website