we are using something similar to the following lambda function and collecting glue customer metrics as per the article :
https://medium.com/@ettefette/metrics-for-aws-glue-jobs-as-you-know-them-from-lambda-functions-e5e1873c615c
But we see different number of failed tasks in Glue console comparing to what Cloudwatch metrics reporting when trying to find Cound ('Glue glue.driver.aggregate.numFailedTasks')
def handler(event, context):
job_name = event["detail"]["jobName"]
job_run_id = event["detail"]["jobRunId"]
cloudwatch = boto3.client("cloudwatch", region_name="eu-central-1")
if event["detail-type"] == "Glue Job State Change":
job_status = event["detail"]["state"]
if job_status not in ["SUCCEEDED", "FAILED", "TIMEOUT", "STOPPED"]:
raise AttributeError("Job state is not supported.")
if job_status == "SUCCEEDED":
metric_value = 1.0
else:
metric_value = 0.0
cloudwatch.put_metric_data(
MetricData=[
{
"MetricName": "JobStatus",
"Dimensions": [
{"Name": "JobName", "Value": job_name},
{"Name": "JobRunId", "Value": job_run_id},
{"Name": "JobStatus", "Value": job_status},
],
"Unit": "None",
"Value": metric_value,
}
],
Namespace="Glue",
)
=======================
Any ideas ?