- Newest
- Most votes
- Most comments
Hello, I understand that you are looking for some clarification on the values your have observed for the "glue.driver.aggregate.numFailedTasks" metric. The metric you are using: 'glue.driver.aggregate.numFailedTasks' is not the best metric to track failed glue jobs. Please note that the overall job would fail only if the same Spark task failed 4 times. Any failed operation that was not performed as a Spark task would not be reported in this metric, as it only tracks failed Spark tasks. Glue publishes metric data to CloudWatch every 30 seconds. Any job run with an execution time of less than 30 seconds would not generate any metrics data and would be missed by an alarm on the "glue.driver.aggregate.numFailedTasks" metric.
Also, User-error tasks are not considered by the above metric and only the failures related to Spark in the script are considered by the metric. You can check the CloudWatch logs under the job runs tab in the Glue console to check for the failures of the tasks and to know the cause of it.
As a workaround, It's possible to be notified of a Glue Job failure using Eventbridge Rules(documentd below )by using the below event rule pattern. This event rule must be created in the same region as your Glue resources: [+] Automating AWS Glue with CloudWatch Events - https://docs.aws.amazon.com/glue/latest/dg/automating-awsglue-with-cloudwatch-events.html
The events can then be sent to an SNS queue to alert you on job run failures. [+] Glue SNS Notifications - https://repost.aws/knowledge-center/glue-sns-notification-state
Sample Event Pattern:
{
"source": ["aws.glue"],
"detail-type": ["Glue Job State Change"],
"detail":{
"state":["FAILED","TIMEOUT"].
}
}
If you still face any issues regarding the metric specified, I would request you to raise a support case with the AWS Premium Support Team for more specific troubleshooting.
Relevant content
- asked a month ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago