Glue job limit 2000

0

We are utilizing AWS Glue as our ETL engine to transfer data from on-premises and cloud databases to an S3 bucket. Currently, we have a Lambda function that triggers a Glue job, but each time we invoke the Lambda function, it runs a separate Glue job for each individual table. Given that we have a large number of tables and Glue only supports a maximum of 2000 concurrent jobs, this has led to issues.

How can we resolve this? Is it possible to configure a single Glue job to process all the tables within the database?

2 Answers
0

The issue you're facing with AWS Glue job limits is a common challenge when dealing with a large number of tables. While the default limit for concurrent job runs per account is actually 30 (not 2000 as mentioned), there are several strategies you can employ to optimize your ETL process and work within the service quotas:

  1. Job Queuing: Enable job queuing for your Glue jobs. This feature automatically queues job runs that exceed the concurrent job limit, putting them in a "WAITING" state until resources become available. To enable this in the AWS Glue Studio console, go to the Job details tab of your job and select "Enable job runs to be queued to run later when they cannot run immediately due to service quotas."

  2. Batch Processing: Instead of triggering a separate Glue job for each table, consider modifying your approach to process multiple tables within a single Glue job. You can create a more dynamic ETL script that iterates through a list of tables and processes them in batches. This approach can significantly reduce the number of concurrent jobs needed.

  3. Workflow Orchestration: Utilize AWS Glue Workflows to orchestrate your ETL process. A workflow can manage the execution of multiple jobs, crawlers, and triggers in a coordinated manner, helping you stay within service quotas while processing all your tables.

  4. Increase Quotas: If you consistently need more concurrent runs, consider requesting a quota increase from AWS support. You can review your current quotas in the Service Quotas console and request increases for limits such as "Max concurrent job runs per account."

  5. Optimize Job Resource Allocation: Ensure you're efficiently using resources within each job. You might be able to process multiple tables within a single job run by allocating more Data Processing Units (DPUs) to the job.

  6. Implement a Backoff Strategy: In your Lambda function, implement a backoff strategy that spaces out job submissions to avoid overwhelming the system and hitting concurrency limits.

By implementing these strategies, particularly focusing on batch processing and workflow orchestration, you should be able to process all your tables efficiently while staying within AWS Glue's service quotas. Remember to monitor your job runs closely and adjust your approach as needed based on performance and quota utilization.

Sources
Glue queue: Max concurrent runs exceeded | AWS re:Post
Introducing job queuing to scale your AWS Glue workloads | AWS Big Data Blog
Converting ETL processes to AWS Glue in AWS Schema Conversion Tool - AWS Schema Conversion Tool

profile picture
answered 22 days ago
profile pictureAWS
EXPERT
reviewed 22 days ago
0

Yes, you can definitely configure a single AWS Glue job to process multiple tables, which would be a more efficient approach. Here are several ways to implement this:

  1. Dynamic Frame Collection Approach
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# Initialize context and job
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
job.init(args['JOB_NAME'], args)

# Get all tables from the catalog
database = "your_database_name"
tables = glueContext.create_dynamic_frame_from_catalog(
    database=database
).toDF().collect()

# Process each table
for table in tables:
    # Read data from source
    dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
        database=database,
        table_name=table.name
    )
    
    # Your transformation logic here
    
    # Write to S3
    glueContext.write_dynamic_frame.from_options(
        frame=dynamic_frame,
        connection_type="s3",
        connection_options={
            "path": f"s3://your-bucket/{table.name}/"
        },
        format="parquet"
    )

job.commit()
  1. Parameter-Based Approach
import sys
import json
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# Initialize context and job
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

# Get job parameters
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'table_list'])
table_list = json.loads(args['table_list'])

# Process each table from the parameter
for table in table_list:
    dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
        database="your_database",
        table_name=table
    )
    
    # Your transformation logic here
    
    # Write to S3
    glueContext.write_dynamic_frame.from_options(
        frame=dynamic_frame,
        connection_type="s3",
        connection_options={
            "path": f"s3://your-bucket/{table}/"
        },
        format="parquet"
    )

job.commit()
  1. Configuration File Approach
import sys
import boto3
import json
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

def read_config_from_s3(bucket, key):
    s3 = boto3.client('s3')
    response = s3.get_object(Bucket=bucket, Key=key)
    return json.loads(response['Body'].read().decode('utf-8'))

# Initialize context and job
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

# Read configuration
config = read_config_from_s3('your-config-bucket', 'config.json')

# Process tables based on configuration
for table_config in config['tables']:
    dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
        database=table_config['database'],
        table_name=table_config['table_name']
    )
    
    # Apply transformations based on config
    if 'transformations' in table_config:
        # Apply custom transformations
        pass
    
    # Write to S3
    glueContext.write_dynamic_frame.from_options(
        frame=dynamic_frame,
        connection_type="s3",
        connection_options={
            "path": table_config['target_path']
        },
        format=table_config.get('format', 'parquet')
    )

job.commit()
  1. Lambda Modification:
def lambda_handler(event, context):
    glue = boto3.client('glue')
    
    # Start single Glue job with table list
    response = glue.start_job_run(
        JobName='your_multi_table_job',
        Arguments={
            '--table_list': json.dumps(['table1', 'table2', 'table3'])
        }
    )
    
    return {
        'statusCode': 200,
        'body': json.dumps(f"Started job run: {response['JobRunId']}")
    }

This approach will:

  • Reduce the number of concurrent Glue jobs
  • Improve resource utilization
  • Make the process more manageable
  • Allow for better error handling and monitoring
  • Provide more flexibility in how tables are processed

Remember to properly size your Glue job resources based on the total data volume you're processing, as you're now handling multiple tables in a single job.

profile pictureAWS
EXPERT
answered 22 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions