Skip to content

How do I resolve tasks that are stuck in the running state in my Amazon MWAA environment?

5 minute read
0

I want to troubleshoot tasks that are stuck in the running state or are zombie tasks for my Amazon Managed Workflows for Apache Airflow (Amazon MWAA) environment.

Short description

When you run Directed Acyclic Graphs (DAGs) in Amazon MWAA, tasks might get stuck in the running state or become zombie tasks. This can happen even when all child tasks and work are complete, no errors are in the task logs, and the status shows success or failure. This issue might affect individual tasks or entire DAGs, and require you to mark them as successful or failed to continue the pipeline execution.

This issue might occur for the following reasons:

  • There's high resource use on the workers.
  • There are database connection issues between the worker and scheduler containers.
  • There are scheduler resource constraints.
  • There's improper configuration of Apache Airflow parameters.
  • You don't follow DAG writing best practices.
    Note: For more information, see Best practices on the Apache Airflow website.

Resolutions

Monitor and optimize worker resources

Tasks get stuck in the running state when the workers experience high CPU or memory use. When workers become overwhelmed, they lose connectivity to the metadata database and fail to properly update the task status.

To check worker resource use, complete the following steps:

  1. Use the Amazon CloudWatch console to check the resource use for base workers and additional workers.
  2. Select Maximum for the statistic, and then choose a detailed period of 1 minute or 5 minutes.
  3. Look for consistent peaks that are more than 80%. These indicate resource constraints.
    Note: Make sure that the memory and CPU use is less than the 90% threshold.

To optimize worker resources, take the following actions:

  • Access the dag_processor_manager.log in the DAG processing log group. If you have high DAG parse times, then you must refactor the DAG. For more information, see dag_processor_manager_log_location on the Apache Airflow website.
  • Review and refactor the code submitted to the scheduler and workers. For more information, see Reducing DAG complexity on the Apache Airflow website.
  • Adjust the celery.worker_autoscale to a lower value than the default level of concurrency for your environment class.
  • Identify the timeframes when the workload is the heaviest and stagger task execution. Use depends_on_past or other dependencies to add slight delays between task groups.
  • Set worker, scheduler, and webserver logs to WARNING level instead of INFO. When log levels are set to INFO level, Amazon MWAA creates too many detailed logs in CloudWatch. Excessive logging might slow down your CPU performance and use up too many resources.

Resolve database connection issues

Tasks can also become stuck in the running state when workers lose connection to the metadata database. This can prevent status updates. When this issue occurs, you receive the following database related errors:

  • "psycopg2.OperationalError: SSL connection has been closed unexpectedly"
  • "sqlalchemy.exc.OperationalError: (psycopg2.errors.ConnectionException) Timed-out waiting to acquire database connection"
  • SIGTERM/SIGKILL errors.

To resolve database connection issues, take the following actions:

  • To create a new Python interpreter for each task, isolate database connectivity, and prevent connection issues for multiple tasks, set the core.execute_tasks_new_python_interpreter to True.
    Note: For more information, see Setting configuration options on the Apache Airflow website.
  • To minimize database load, clean up the PostgreSQL metadata database regularly.
    Note: Avoid passing large data volumes through XCom. Instead, use Amazon Simple Storage Service (Amazon S3) to pass and store large strings of data.
  • Use variable calls only when necessary and according to DAG writing best practices.
    Note: Each call for a variable opens a database connection.
  • To tune database performance, adjust core.min_serialized_dag_update_interval, scheduler.dag_dir_list_interval scheduler.min_file_process_interval, and scheduler.scheduler_idle_sleep_time.

Optimize scheduler performance

Scheduler overload can cause tasks to be stuck in the running state when the scheduler can't properly track task progress.

To resolve this issue, take the following actions:

  • To reduce scheduler load, add scheduler.schedule_after_task_execution and set it to False. Also add scheduler.parsing_processes.
  • To monitor scheduler health, check for any drops in the SchedulerHeartbeat CloudWatch metric.

Follow Apache Airflow best practices

Complex DAGs can contribute to tasks being stuck in the running state.

To prevent these issues, take the following actions:

  • Avoid heavy data processing in Apache Airflow workers. Instead, offload compute-intensive operations to purpose-built services such as Amazon EMR, AWS Glue, or Amazon Elastic Kubernetes Service (Amazon EKS).
  • Move imports and logic inside task functions when possible.
    Note: Top-level code runs every DAG parsing cycle and consume resources.
  • Create an .airflowignore file to exclude unnecessary DAGs from processing.
  • Avoid calling Apache Airflow variables in top-level code. Use variable caching when variables are needed at parse time.
  • Implement deferrable operators for long-running external processes.
    Note: These operators release worker slots while waiting for external processes, and prevent workers from remaining occupied during idle waiting periods.

Upgrade environment class

If you still experience issues after you optimize your environment, then upgrade your environment class. Each class instance doubles the compute resources that are available for your environment to execute tasks.

Related information

Performance tuning for Apache Airflow on Amazon MWAA

Configuring the Amazon MWAA environment class

Configuration reference on the Apache Airflow website