How do I resolve the "psycopg2.OperationalError: SSL connection has been closed unexpectedly error" in Amazon MWAA?
I want to troubleshoot the "psycopg2.OperationalError: SSL connection has been closed unexpectedly" error in Amazon Managed Workflows for Apache Airflow (Amazon MWAA).
Short description
When you run your directed acyclic graph (DAG) and a task fails, you might receive the following error messages in the task log:
- "psycopg2.OperationalError: SSL connection has been closed unexpectedly".
- "sqlalchemy.exc.OperationalError: (psycopg2.errors.ConnectionException) Timed-out waiting to acquire database connection".
The preceding error messages occur when the worker overloads or loses connection to the metadata database, or the task tries to reuse a closed database connection.
To resolve the issue, take the following actions:
- Reduce the CPU and memory utilization.
- Check your workers' configurations.
- Use deferrable operators.
- Remove top-level code.
- Don't use Apache Airflow variables.
- Simplify your DAG.
- Create an .airflowignore file and add it to the main DAG folder in Amazon Simple Storage Service (Amazon S3).
Resolution
Reduce the CPU and memory utilization
An overloaded scheduler, worker, or web server might not be able to maintain active database connections. The default idle connection timeout for the Amazon Relational Database Service (Amazon RDS) for Apache Airflow metadata is 30 minutes. If a task or process remains idle or unresponsive for longer than this time period, then the database might close the connection. For more information, see IdleClientTimeout.
To check the CPU and memory utilization metrics, complete the following steps:
- Open the Amazon CloudWatch console.
- Choose the AWS/MWAA namespace.
- On the All metrics tab, choose Cluster.
- Choose CPUUtilization, and then choose Graph all search results.
- Choose the Graphed metrics tab to view performance metrics.
- Repeat steps 1–3, and then choose MemoryUtilization.
- Choose Graph all search results, and then choose the Graphed metrics tab to view performance metrics.
Note: You must select Maximum for Statistic and 1 minute for Period.
For more information, see Container, queue, and database metrics for Amazon MWAA.
If your workers' CPU and memory utilization are consistently above 90%, then the workers are overloaded or the tasks are too heavy. Heavy CPU and memory use might cause issues with task status reports, metadata database communication, and new tasks.
To reduce the workers' CPU and memory utilization, take the following actions:
- Because workers require more resources during the initial phases of task creation, spread out your task schedules.
- Narrow your imports to only tasks that require them, and remove global imports that aren't required.
- Use Apache Airflow and Amazon MWAA as your main tools. Schedule your remaining tasks on services such as Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), or AWS Glue.
Note: Don't process large volumes of data on workers because they have limited resources. Use services that are designed to process large volumes of data. - Reduce the data that's passed between tasks. When possible, chunk the data so that downstream data doesn't change because of overloaded Amazon MWAA workloads.
Note: If you must pass large volumes of data between tasks, then store the data externally and import it as needed. Some operators use XComs. For more information, see XComs on the Apache Airflow website. - Regularly clean the metadata database.
Note: High CPU utilization or low free memory might cause queries to fail or increase latency that results in an SSL exception error.
Note: If you can't reduce the CPU, then use a larger environment class with more resources for the workers or further tune your workers.
Check your workers' configurations
Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.
Each Apache Airflow version has default values for worker configurations. However, you might need to customize the configurations based on your requirements.
Set your Apache Airflow configurations
Use the Amazon MWAA console, AWS CLI update-environment command, or an UpdateEnvironment API call to customize the following configurations:
- The celery.worker_autoscale configuration controls the maximum and minimum number of tasks that can run concurrently on a worker. Reduce the maximum database connections from the worker. Then, allocate more resources for each task for a smaller number of tasks with the same worker resources.
- Set core.execute_tasks_new_python_interpreter to True to create a new Python interpreter for each task and isolate database connectivity.
Note: Set the configuration to True only if the "psycopg2.OperationalError" error recurs. - Use the auto scaling feature in Amazon MWAA to increase the maximum worker count.
Note: Set both the max-workers and celery.worker_autoscale values to accommodate your workload.
For more information, see Using Apache Airflow configuration options on Amazon MWAA.
Check your worker, scheduler, and web server logs
In production environments, verbose logging uses unnecessary resources and increases CPU utilization. It's a best practice to set worker, scheduler, and web server logs to at least WARNING. In most cases, INFO logs aren't necessary.
Use deferrable operators
Use deferrable operators when your tasks submit jobs to external systems or services and wait for events to occur, such as an EMR cluster creation. Deferrable operators release worker slots and increase workload capacity. For more information, see Deferrable operators & triggers on the Apache Airflow website.
Remove top-level code
Remove top-level code that uses unnecessary resources from the scheduler. Also, adjust the dag_processor.min_file_process_interval configuration for your use case. For more information, see min_file_process_interval and Best practices on the Apache Airflow website.
Don't use Apache Airflow variables
Don't use Apache Airflow variables in top-level Python code. When you retrieve variables, you must query the database and the query occurs on each parsing loop.
Note: If you must use variables, then use variable caching. For more information, see use_cache on the Apache Airflow website.
Simplify your DAG
Simplify your DAG, and test your DAG's code for the specific Python version of your Amazon MWAA environment.
Use dynamic task mapping when necessary. For more information, see Dynamic task mapping on the Apache Airflow website.
Create an .airflowignore file and add it to the main DAG folder in Amazon S3
Apache Airflow scheduler syncs the content of the DAG folder from Amazon S3 to each worker. The scheduler also periodically parses the files in Amazon S3 to update the UI. When it parses, the scheduler runs the files and generates the DAGs.
Use the .airflowignore file to specify the folders and files from the DAG folder in Amazon S3. You must use regex and glob patterns. Disregard the non-DAG files during the parsing process to reduce the scheduler load on the scheduler and improve the DAG efficiency. For more information, see .airfowignore on the Apache Airflow website.
- Topics
- Application Integration
- Language
- English

Relevant content
- asked 5 months ago