Glue job failure not shown in the metrics aws.glue.glue_driver_aggregate_num_failed_tasks

0

Once the Glue job fails it should increase the count of aws.glue.glue_driver_aggregate_num_failed_tasks metric but it doesnt increase the count, i went through the answers to already given questions , they ask to add a sleep time to make the job run for atleast 30 seconds , still not of much use

KG
asked 7 months ago296 views
1 Answer
0

Hello, I understand that you are looking for some clarification on the values your have observed for the "glue.driver.aggregate.numFailedTasks" metric. The metric you are using: 'glue.driver.aggregate.numFailedTasks' is not the best metric to track failed glue jobs. Please note that the overall job would fail only if the same Spark task failed 4 times. Any failed operation that was not performed as a Spark task would not be reported in this metric, as it only tracks failed Spark tasks. Glue publishes metric data to CloudWatch every 30 seconds. Any job run with an execution time of less than 30 seconds would not generate any metrics data and would be missed by an alarm on the "glue.driver.aggregate.numFailedTasks" metric.

Also, User-error tasks are not considered by the above metric and only the failures related to Spark in the script are considered by the metric. You can check the CloudWatch logs under the job runs tab in the Glue console to check for the failures of the tasks and to know the cause of it.

As a workaround, It's possible to be notified of a Glue Job failure using Eventbridge Rules(documentd below )by using the below event rule pattern. This event rule must be created in the same region as your Glue resources: [+] Automating AWS Glue with CloudWatch Events - https://docs.aws.amazon.com/glue/latest/dg/automating-awsglue-with-cloudwatch-events.html

The events can then be sent to an SNS queue to alert you on job run failures. [+] Glue SNS Notifications - https://repost.aws/knowledge-center/glue-sns-notification-state

Sample Event Pattern:
{
"source": ["aws.glue"],
"detail-type": ["Glue Job State Change"],
"detail":{
"state":["FAILED","TIMEOUT"].
}

}

If you still face any issues regarding the metric specified, I would request you to raise a support case with the AWS Premium Support Team for more specific troubleshooting.

answered 7 months ago
profile picture
EXPERT
reviewed a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions