Can we have one job for loading multiple tables with different bookmarks storing the last execution file

0

Need to write multiple tables from S3 to RDS database. Can I create just one job and send the table names as parameters.
As I have number of tables to be loaded it will become hectic to create one job for each table.

For eg there are 2 s3 paths: s3://my_bucket/table_A , s3://my_bucket/table_B having parquet files generated every hour. Need to store the data from the S3 in table_A and table_B tables respectively and also save the last run file for both. I know that it is possible but will the job bookmark save the last executed file for both?

Any other way to achieve this?

4 Answers
0

Hello,

In order to achieve this use case, here is one option:

It will involve leveraging Job Parameters + Job Bookmarks + a little bit of coding:

Leverage Job Parameters to programmatically pass different arguments to the job (in this case, the S3 paths or prefixes that points to each dataset to be read). The arguments would be then retrieved in the script via getResolvedOptions(https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-get-resolved-options.html).

In terms of Job Bookmarks, this shouldn't be a problem as long as each datasources being read (i.e., S3 paths containing each dataset in question) includes their own dedicated transformation_ctx property, respectively. That is, the value of transformation_ctx for each datasource read has to be unique. This can be achieved with the same approach as in the above point: Use Job Parameters and retrieve them as arguments to be later used as the value of transformation_ctx (or re-use the same arguments that you'll pass for the S3 paths per se). The most important things are: (1) Each datasource has to have a unique transformation_ctx; (2) Subsequent JobRuns should use the same transformation_ctx values for each datasource, respectively. That way, the Bookmarks will be able to keep track of the already-processed data from each datasource, from each S3 path.

See Tracking Processed Data Using Job Bookmarks(https://t.corp.amazon.com/V595024722/communication#:~:text=Tracking%20Processed%20Data%20Using%20Job%20Bookmarks) for more details about transformation_ctx and Job Bookmarks in general. Make also sure the script includes the lines highlighted in bold as seen in the sample script in https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html#monitor-continuations-script

AWS
SUPPORT ENGINEER
answered 2 years ago
  • Yes, I am able to achieve this, but this will lead to serial run of table loads in target. If any of the previous table load fails the entire load will stop and not run for next tables. Any other way using which I can have one script only for all of the tables and bookmark state saved too.

  • This worked, but what if any of the table load failed, and I would have to rewind the bookmark. There is no way to rewind bookmark programmatically. Is there any other way of rewinding bookmark for specific tables so that it can be read again in next run?

0

Hello,

Please provide the following details:

1:Method that you are using to migrate from S3 to RDS 2: The RDS engine that you are migrating to.

Looking forward for your response.

AWS
SUPPORT ENGINEER
answered 2 years ago
  • Using Glue service I am reading the files and migrating to Postgres. But the same script needs to be run for 100s of tables, the S3 path and table name needs to be dynamic that's it. Otherwise I will have to create that many jobs. I was thinking of creating multiple workflows and passing run parameters, but the job bookmarks will not be stored for all tables at once. Is there any other way to do this

0

Though asked 6 months back still answering , because I did similar steps for ETLing data from MySQl-DMS-S3.

  1. Create individual jobs per table. For me , the code for all the jobs were same , [read source ( bookmarked ) select the latest version of the record , read target , identify the s3 file to be rewritten , write to target ] the only thing that differed was the source table name and the target s3 bucket , which I parameterised and stored the list of tables in a JSON file in a S3 bucket named Config.
  2. The jobs were divided into 2 parts , 1 controller python script and the other PySpark glue jobs per table which were essentially the same code

This gives 2 benefits :

  1. If any individual job fails, you get the option to rewind and rerun the job
  2. The python controller script starts the Glue jobs one after the other without waiting , hence your jobs run in parallel
  3. The controller script however waits on completion of all jobs to finish . So the total time for the python job to run = to the longest pyspark glue job run time.
  4. If you need to customise the job for any specific table , you can do that
  5. Workers and worker types for each job can be tuned according to the need instead of over or under powering the jobs
  6. Finally each of these jobs also write run time metrics like rows read, written time taken , bookmarks , success status to another S3 file which is then visualized in Quicksight for ETL monitoring.
answered a year ago
  • Hello, I'm very interested in this approach. Would you be willing to provide a boilerplate for the two scripts and a sample json of tables? I'm newer to pyspark code but am aware of being able to pass parameters and job bookmark concepts. Kindest regards

0

Refer https://github.com/aws-samples/aws-big-data-blog-dmscdc-walkthrough

This will give you the basic boiler plate of how to work with the controller and individual table level jobs . The blog is very well documented and helpful . It uses DynamoDB for checkpointing , which I have replaced with Glue Bookmarks in my version.

Below is a sample content of the JSON config file :

{
    "etl-job-configs" : {

        "master-etl-job":[
            {
                "etl-job-names":"table-1-etl-job,table-2-etl-job",
                "source_bucket" :"source-bucket-name",
                "target_bucket":"target-bucket-name",
                "crawler_names": "crawler-name",
                "run_crawler"  : 1
            }
        ],
        "table-1-etl-job-config" :[
            {
                "source_bucket" :"source-bucket-name",
                "target_bucket":"target-bucket-name",
                "table_name"          :"table-1",
                "primary_key"         :"id",
                "input_partition_key" :"modify_time",
                "target_partition_key":"create_time",
                "reset_bookmark"      :0,
                "skip_records"        :""

            }
        ],
        "table-2-etl-job-config" :[
{...}]
}

Repeat for the number of tables that you have .The controller and the table-etl jobs read this same config file.

  1. The first element is for the controller which gets the info on the list of jobs to run .
  2. The source and target buckets can be configured at the master level or can be overridden at the individual job level.
  3. You can also enhance to include options to run crawlers or reset bookmarks or skip any specific records as required
  4. This will help in altering the job parameters without touching the job code
  5. Pass this config file name as a parameter to the glue jobs.

Finally check out the functions and their usages for Boto3 S3 and Glue at

  1. https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html
  2. https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html
answered a year ago
  • Thanks so much. : ) Really appreciate it!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions