跳至内容

LastRuntimeSeconds of crawler's metric has a limitation!?

0

Hi everyone, I need a bash script which gives me the maximum runtime of each crawler. it seems that the 'LastRuntimeSeconds' just gives the latest run time.how can I have the max of run time for each crawler? I wanna use bash or python. not aws console (I usded aws glue get-crawler-metrics , with -query 'CrawlerMetricsList[*].LastRuntimeSeconds' , it just passed one runtime) thank you

已提问 1 年前229 查看次数
1 回答
0

To get the maximum runtime of each AWS Glue crawler over a period, you can use AWS CloudWatch metrics, as Glue logs metrics there for each run. By querying these metrics, you can find the maximum runtime for each crawler. Here’s how you can do this using a combination of aws CLI commands and a Python script.

Using AWS CloudWatch Metrics

1. List all Crawlers:

  • First, list all the crawlers in your AWS Glue.

2. Get CloudWatch Metrics:

  • For each crawler, query the CloudWatch metrics to get the maximum runtime over a specified period.

Step-by-Step Guide

Step 1: List All Crawlers

You can list all your AWS Glue crawlers using the aws glue list-crawlers command.

aws glue list-crawlers --query 'CrawlerNames' --output text

Step 2: Query CloudWatch Metrics

You can then use the CloudWatch get-metric-statistics command to query the Glue Crawler Metrics for the maximum runtime.

Example Python Script

Here’s a Python script that accomplishes this:

import boto3
from datetime import datetime, timedelta

def get_max_runtime(crawler_name, cloudwatch, start_time, end_time):
    response = cloudwatch.get_metric_statistics(
        Namespace='Glue',
        MetricName='CrawlerRunTime',
        Dimensions=[
            {
                'Name': 'CrawlerName',
                'Value': crawler_name
            }
        ],
        StartTime=start_time,
        EndTime=end_time,
        Period=86400,  # One day in seconds
        Statistics=['Maximum']
    )

    if 'Datapoints' in response and response['Datapoints']:
        return max(dp['Maximum'] for dp in response['Datapoints'])
    else:
        return None

def main():
    glue = boto3.client('glue')
    cloudwatch = boto3.client('cloudwatch')

    # Get the list of all crawlers
    crawlers = glue.list_crawlers()['CrawlerNames']

    # Define the time period for the metrics
    end_time = datetime.utcnow()
    start_time = end_time - timedelta(days=30)  # Last 30 days

    crawler_max_runtimes = {}

    for crawler in crawlers:
        max_runtime = get_max_runtime(crawler, cloudwatch, start_time, end_time)
        crawler_max_runtimes[crawler] = max_runtime

    for crawler, runtime in crawler_max_runtimes.items():
        print(f"Crawler: {crawler}, Max Runtime: {runtime} seconds")

if __name__ == "__main__":
    main()

Explanation

1. AWS SDK Initialization:

  • Initialize the AWS Glue and CloudWatch clients using boto3.

2. Get Crawler Names:

  • List all crawlers using glue.list_crawlers().

3. Get Maximum Runtime for Each Crawler:

  • For each crawler, query the CloudWatch CrawlerRunTime metric.
  • Specify the StartTime and EndTime to define the period for which you want to get the metrics.
  • Use the Maximum statistic to get the maximum runtime.

4. Output:

  • Print the maximum runtime for each crawler.

Running the Script

Make sure you have the AWS CLI configured with the necessary permissions and boto3 installed. You can run the script in an environment where AWS CLI is configured:

pip install boto3
python script_name.py

This script will provide the maximum runtime for each Glue crawler over the last 30 days. You can adjust the start_time and end_time variables to modify the time range as needed.

专家
已回答 1 年前
专家
已审核 1 年前
  • thank you , however 'Datapoints': [] . so it shows 'None seconds' in output for every crawler. do you have any opinion for this?

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。