CW Metric Filter to group Glue crawler logs by crawl id

0

Hi! I read through https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html for CW metric filter syntax and don't think this is possible, but wanted to ask anyways to see if anyone else has the same use case. Our Glue Crawler runs drop logs into CloudWatch with the following syntax:

*2023-03-31T09:00:40.477-07:00	[1bd316e6-9020-456c-9c3e-7c96f80b6b6a] BENCHMARK : Running Start Crawl for Crawler staging_table
2023-03-31T09:00:40.825-07:00	[1bd316e6-9020-456c-9c3e-7c96f80b6b6a] INFO : The crawl is running by consuming Amazon S3 events.
2023-03-31T09:00:41.323-07:00	[1bd316e6-9020-456c-9c3e-7c96f80b6b6a] INFO : The number of messages in the SQS queue arn:aws:sqs:us-west-2:xxxxxxxxx:staging-crawler-queue is 8
2023-03-31T09:00:41.617-07:00	[1bd316e6-9020-456c-9c3e-7c96f80b6b6a] INFO : The number of unique events received is 2 for the target with database: staging
2023-03-31T09:02:48.853-07:00	[1bd316e6-9020-456c-9c3e-7c96f80b6b6a] BENCHMARK : Classification complete, writing results to database staging
2023-03-31T09:02:48.880-07:00	[1bd316e6-9020-456c-9c3e-7c96f80b6b6a] INFO : Crawler configured with Configuration {"Version":1.0,"CrawlerOutput":{"Partitions":{"AddOrUpdateBehavior":"InheritFromTable"}},"Grouping":{"TableGroupingPolicy":"CombineCompatibleSchemas"}} and SchemaChangePolicy {"UpdateBehavior":"LOG","DeleteBehavior":"LOG"}. Note that values in the Configuration override values in the SchemaChangePolicy for S3 Targets.
2023-03-31T09:08:29.205-07:00	[1bd316e6-9020-456c-9c3e-7c96f80b6b6a] INFO : Some files do not match the schema detected. Remove or exclude the following files from the crawler (truncated to first 200 files): staging-xxxxxxxxxxxx-us-west-2-prod/xxx/organization_id=xxxx/title_id=xxxxxxxx/land_date=2022-12-07/land_hour=17/abcdefghij.gz
2023-03-31T09:08:38.075-07:00	[1bd316e6-9020-456c-9c3e-7c96f80b6b6a] INFO : Discovered schema changes for Table staging_table in database staging
2023-03-31T09:08:54.188-07:00	[1bd316e6-9020-456c-9c3e-7c96f80b6b6a] INFO : Created partitions with values [[xxx, xxxxxx, abcde, 15], [yyy, yyyyyy, xyz, 16]] for table staging_table in database staging
2023-03-31T09:09:15.901-07:00	[1bd316e6-9020-456c-9c3e-7c96f80b6b6a] BENCHMARK : Finished writing to Catalog
2023-03-31T09:09:15.945-07:00	[1bd316e6-9020-456c-9c3e-7c96f80b6b6a] INFO : Run Summary For PARTITION:
2023-03-31T09:09:15.945-07:00	[1bd316e6-9020-456c-9c3e-7c96f80b6b6a] INFO : ADD: 2
2023-03-31T09:10:24.005-07:00	[1bd316e6-9020-456c-9c3e-7c96f80b6b6a] BENCHMARK : Crawler has finished running and is in state READY*

Is it possible to create a metric filter that can group by the crawl id, i.e. the 1bd316e6-9020-456c-9c3e-7c96f80b6b6a value in the logs above. Given each time log has a time entry, is it possible to group these logs by the crawl id to extract the duration between the first and last occurrences and have it be reported as a metric with the the crawl id as a dimension. Glue does not publish any CloudWatch metrics for the crawler, and this is one option I'm exploring so we can monitor and visualize our crawl times over time.

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions