- Newest
- Most votes
- Most comments
Based on the symptoms you're describing—orphaned watchtower threads accumulating in worker processes when using deferrable KubernetesPodOperator—this appears to be a resource cleanup issue related to how CloudWatch log handlers are managed during the deferrable task lifecycle.
While I don't have specific documentation confirming this exact issue with MWAA 3.0.6 and deferrable operators, your analysis points to a legitimate concern about handler lifecycle management. The fact that celery.worker_max_tasks_per_child=50 doesn't fully resolve the issue because forked processes inherit the leaked state is particularly telling.
For your immediate situation, here are some approaches to consider:
Configuration Adjustments:
You might try reducing the celery.worker_autoscale values even further below the defaults. For mw1.small environments, the default is 5,0 (maximum, minimum task concurrency). Since you've set it to 1,1, you're already at the minimum. However, you could also consider whether your worker count (min_workers: 10, max_workers: 25) is appropriately sized for your workload, as more workers means more processes that can accumulate these threads.
Monitoring and Reactive Measures:
Since you've identified that MemoryUtilization climbs to ~97% before OOM restarts, you should set up CloudWatch alarms on the MemoryUtilization metric for your AdditionalWorker containers. Monitor this metric with a 1-minute period and appropriate thresholds to get early warning before OOM events occur. You can also track the correlation with ActiveConnectionCount as you've observed.
Escalation Path:
Given that the CloudWatchLogHandler creation and lifecycle is managed by MWAA's infrastructure layer (not user-configurable code), and this appears to be specific to the interaction between deferrable operators and the logging subsystem, this warrants escalation to AWS Support. When opening a case, include:
- Your py-spy thread dumps showing the watchtower thread accumulation
- CloudWatch metrics showing the memory sawtooth pattern correlated with ActiveConnectionCount
- Confirmation that the issue only manifests with deferrable=True on KubernetesPodOperator
- Details about your attempted mitigations
The fact that core.execute_tasks_new_python_interpreter: True (the Airflow 3 default) doesn't prevent the leak suggests this may be a deeper integration issue between MWAA's logging infrastructure and the deferrable task execution model that requires a platform-level fix.
In the interim, if manual SIGTERM to leaky processes is effective, you might consider implementing automated monitoring that detects high thread counts or memory usage and triggers worker recycling, though this is clearly a workaround rather than a solution.
Sources
Performance tuning for Apache Airflow on Amazon MWAA - Amazon Managed Workflows for Apache Airflow
Container, queue, and database metrics for Amazon MWAA - Amazon Managed Workflows for Apache Airflow
Relevant content
- asked a year ago
- asked 3 years ago
- asked 3 months ago
