How can I figure out what is causing my MWAA to rollback?

0

I have a new MWAA cluster, I'm deploying the same code as I have in another cluster that is working, but this one just doesn't work. It has empty values for plugins.zip, requirements.txt, and startup.sh to start with. When I fill those in it accepts the change request but pretty quickly goes into a rollback state (where it stays for hours) and eventually goes to unavailable.

I've been looking around in all the logs, but I'm not seeing anything that is clearly stating why the cluster is rolling back.

How can I find out what is causing the cluster to rollback? Is it in the logs or is there something else I can look at?

asked 21 days ago43 views
1 Answer
0
  1. Are you looking in the MWAA logs in your S3 bucket? The logs are organized into subfolders based on the dag_id, task_id, execution_date, and try_number. The S3 bucket location can be found in the MWAA environment configuration. You can use the MWAA web interface to look at the logs. Using CLI - aws s3 ls s3://<your-mwaa-logs-bucket>/logs/ --recursive
  2. You can also configure MWAA to send log notifications to Amazon SNS or Amazon CloudWatch to receive real-time notifications when new logs are generated.
  3. You can review the MWAA cluster events. The MWAA cluster events provide information about the actions performed on the cluster, such as starting, stopping, updating, or deleting the cluster. These events can help you diagnose issues with your MWAA environment and track the status of your cluster.
  4. Another option is to use the AWS support and get in touch with a support personnel using the chat option. https://docs.aws.amazon.com/awssupport/latest/user/joining-a-live-chat-session.html
AWS
Kraj
answered 21 days ago
  • I have no logs in s3 at all. Maybe because it has never started properly or maybe because I've configured everything to go to cloudwatch? Where would I find these cluster events?

  • Absolutely! Here's a blog on how to check the cluster metrics. https://aws.amazon.com/blogs/compute/introducing-container-database-and-queue-utilization-metrics-for-the-amazon-mwaa-environment/

    1. Please check to see if you have enough memory on the workers, for the tasks that you are attempting with Airflow - as in holding more data in memory. For this, you can check the Maximum stats of the BaseWorker Memory Utilization metric, for a minute duration.
    2. Please check the tasks timeout on the DagBag import. You can change the core.dagbag_import_timeout to a higher number say 120 seconds and see if that resolves the issue. You can always dial it down later. https://docs.aws.amazon.com/mwaa/latest/userguide/best-practices-tuning.htm
    1. Also check to see if you have the environment set up right. If there are failures in installing your specified requirements, plugins, and dependencies files, the environment initiates a rollback to the previous stable version. In this case, it is a new cluster, so there is nothing to roll back to.
    2. To mitigate these issues, ensure that your DAGs and requirements work without issues using the aws-mwaa-local-runner utility and, ideally, test in a staging Amazon MWAA environment. You can test this in your local environment to check your MWAA image. https://github.com/aws/aws-mwaa-local-runner

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions