How to schedule Glue notebook to run with job bookmark enabled?

0

I have written an ETL job in AWS Glue using the interactive notebook and I want to enable job bookmark to avoid reprocessing already processed data. The source data are in an S3 bucket, a Glue data catalog table has been created with the help of a crawler and finally the data are written to an S3 bucket in the target destination.

This is how the code in the notebook looks like. If run the cells in the notebook manually for every run and I enable job bookmark from the cell magic as shown below then the job bookmark works as expected.

%%configure
{
  "JOB_NAME": "etl_job",
  "job-bookmark-option": "job-bookmark-enable"
} 

imports ....
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

glue_df = glueContext.create_dynamic_frame.from_catalog(
    database="my-database", 
    table_name="interim_data",
    transformation_ctx = "datasource0"
)

# Convert DynamicFrame to DataFrame
spark_df = glue_df.toDF()

....

DyF = DynamicFrame.fromDF(spark_df, glueContext, "etl_convert")
s3output = glueContext.getSink(
  path="s3://target_bucket/clean/",
  connection_type="s3",
  updateBehavior="UPDATE_IN_DATABASE",
  partitionKeys=[],
  compression="snappy",
  enableUpdateCatalog=True,
  transformation_ctx="s3output_final_step",
)
s3output.setCatalogInfo(
  catalogDatabase="my-database", catalogTableName="clean_data"
)
s3output.setFormat("glueparquet")
s3output.writeFrame(DyF)

job.commit()

However, if I save the notebook, close it and then run the job from the console, the job bookmark is not enabled. I have even tried to run the job with parameters from the console like the screenshot shows below and this still doesn't work. Enter image description here

Ideally, I would like to schedule the job to run once or twice a week but I am not sure how to do this and still enable the job bookmark. I have seen from this link: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html that I can pass parameters from the AWS CLI.

Should I write a Lambda function to run on schedule with EventBridge and have it run a command like this one? $ aws glue start-job-run --job-name "CSV to Parquet" '--job-bookmark-option': 'job-bookmark-enable''

2 Answers
0

To run a scheduled AWS Glue notebook job from the console with job bookmarking enabled, you need to use the AWS CLI or SDK to start the job run and pass the --job-bookmark-option parameter to enable job bookmarking. Simply saving and running the notebook job from the console does not carry over the job bookmarking option that was enabled when running the notebook interactively.

profile pictureAWS
EXPERT
answered 2 months ago
0

Yes, you can write a Lambda function triggered by EventBridge on a schedule to start a Glue job run with job bookmark enabled by using the AWS CLI command.

To enable job bookmarking when starting a Glue job run via the CLI, you need to pass the --job-bookmark-option parameter with the value job-bookmark-enable, as shown in your example command. This will tell Glue to track state and prevent reprocessing of old data each time the job runs.

So your proposed approach of using EventBridge to trigger a Lambda function on a schedule that then executes that CLI command would work to run your ETL job on a cron and leverage Glue job bookmarks.

profile pictureAWS
EXPERT
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions