The scheduler does not appear to be running

0

Hello I am trying to add my first dag but I am getting the following error: The scheduler does not appear to be running. Last heartbeat was received 3 weeks ago.

The DAGs list may not update, and new tasks will not be scheduled.

I am trying to verify that my dependencies were installed properly as the documentation suggests, I do it from the "Airflow Scheduler Log Group" but it doesn't show me anything in log streams to search for "requirements_install_ip", can someone help me?

Screen Log streams

2 Answers
1

We have experienced this problem multiple times in recent weeks: chasing the error to the scheduler logs we find that it is unable to write to the pgSQL metadata store (on local this would require airflow db upgrade to fix, but we are unable to do this on MWAA).

psycopg2.errors.UndefinedColumn: column dag.has_import_errors does not exist

[...]

sqlalchemy.exc.InternalError: (psycopg2.errors.InFailedSqlTransaction) current transaction is aborted, commands ignored until end of transaction block

All steps recommended by @Arun were taken and in our case:

  • Dependencies/failures from individual DAGs are ruled out, as our production cluster went down but not staging, where both were identical other than environment variable mappings
  • Updating the environment caused the webserver to break as well, with another error indicating a failure to communicate with internal metadata storage: sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) relation "session" does not exist

My theory as to why this occurs is due to differences in the rate at which schedulers and webservers restore themselves, leading to conflicts in the db access.

In our case, the solution was to delete the cluster and re-start it: as our infrastructure deployments are handled by Terraform this was simple, but led to ~2 hours downtime on a prod cluster. Given that this is a managed service, we are distressed that issues like this are commonplace and seem to be impossible to predict or test for.

REMEMBER TO RETRIEVE YOUR ENVIRONMENT AND CONNECTION VARIABLES PRIOR TO DELETING/RESTARTING YOUR CLUSTER.

Rich M
answered 5 months ago
0

Hello,  

There could be many reasons for this issue and in general many users were able to get around the issue by following some or all of the below mentioned steps :

Firstly , as mentioned in [1] (below is an extract) you could consider checking the network settings and other dependencies.

——————————

If the scheduler doesn't appear to be running, or the last "heart beat" was received several hours ago, your DAGs may not appear in Apache Airflow, and new tasks will not be scheduled.

We recommend the following steps:

  1. Confirm that your VPC security group allows inbound access to port 5432. This port is needed to connect to the Amazon Aurora PostgreSQL metadata database for your environment. After this rule is added, give Amazon MWAA a few minutes, and the error should disappear. To learn more, see [2].

  2. If the scheduler is not running, it might be due to a number of factors such as dependency installation failures [3] , or an overloaded scheduler [4]. Confirm that your DAGs, plugins, and requirements are working correctly by viewing the corresponding log groups in CloudWatch Logs. To learn more, see Monitoring and metrics for Amazon Managed Workflows for Apache Airflow (MWAA) [5].

——————————

Secondly , you could also consider refreshing the environment by following the below steps :

——————————

Refreshing the MWAA Environment

  {Amazon MWAA console > Environments > [Environment-name] > Edit }

   Note: Clicking on Edit button and Next button will update the MWAA Environment.

——————————

Finally, if the issue persists request you to please reach out to support engineering with the below information :

——————————

MWAA environment ARN

Logs of MWAA Environment (DAG processing, web server, scheduler, task and worker logs)

Sample dag

Requirements and plugin files

——————————

————————

Reference:

[1] https://docs.aws.amazon.com/mwaa/latest/userguide/t-apache-airflow-11012.html#error-scheduler-11012

[2] https://docs.aws.amazon.com/mwaa/latest/userguide/vpc-security.html

[3] https://docs.aws.amazon.com/mwaa/latest/userguide/best-practices-dependencies.html

[4] https://docs.aws.amazon.com/mwaa/latest/userguide/best-practices-tuning.html

[5] https://docs.aws.amazon.com/mwaa/latest/userguide/cw-metrics.html

================  

Have a nice day!

SUPPORT ENGINEER
Arun
answered 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions