Can we have one job for loading multiple tables with different bookmarks storing the last execution file
Need to write multiple tables from S3 to RDS database. Can I create just one job and send the table names as parameters.
As I have number of tables to be loaded it will become hectic to create one job for each table.
For eg there are 2 s3 paths: s3://my_bucket/table_A , s3://my_bucket/table_B having parquet files generated every hour. Need to store the data from the S3 in table_A and table_B tables respectively and also save the last run file for both. I know that it is possible but will the job bookmark save the last executed file for both?
Any other way to achieve this?
In order to achieve this use case, here is one option:
It will involve leveraging Job Parameters + Job Bookmarks + a little bit of coding:
Leverage Job Parameters to programmatically pass different arguments to the job (in this case, the S3 paths or prefixes that points to each dataset to be read). The arguments would be then retrieved in the script via getResolvedOptions(https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-get-resolved-options.html).
In terms of Job Bookmarks, this shouldn't be a problem as long as each datasources being read (i.e., S3 paths containing each dataset in question) includes their own dedicated transformation_ctx property, respectively. That is, the value of transformation_ctx for each datasource read has to be unique. This can be achieved with the same approach as in the above point: Use Job Parameters and retrieve them as arguments to be later used as the value of transformation_ctx (or re-use the same arguments that you'll pass for the S3 paths per se). The most important things are: (1) Each datasource has to have a unique transformation_ctx; (2) Subsequent JobRuns should use the same transformation_ctx values for each datasource, respectively. That way, the Bookmarks will be able to keep track of the already-processed data from each datasource, from each S3 path.
See Tracking Processed Data Using Job Bookmarks(https://t.corp.amazon.com/V595024722/communication#:~:text=Tracking%20Processed%20Data%20Using%20Job%20Bookmarks) for more details about transformation_ctx and Job Bookmarks in general. Make also sure the script includes the lines highlighted in bold as seen in the sample script in https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html#monitor-continuations-script
Yes, I am able to achieve this, but this will lead to serial run of table loads in target. If any of the previous table load fails the entire load will stop and not run for next tables. Any other way using which I can have one script only for all of the tables and bookmark state saved too.
This worked, but what if any of the table load failed, and I would have to rewind the bookmark. There is no way to rewind bookmark programmatically. Is there any other way of rewinding bookmark for specific tables so that it can be read again in next run?
Please provide the following details:
1:Method that you are using to migrate from S3 to RDS 2: The RDS engine that you are migrating to.
Looking forward for your response.
Using Glue service I am reading the files and migrating to Postgres. But the same script needs to be run for 100s of tables, the S3 path and table name needs to be dynamic that's it. Otherwise I will have to create that many jobs. I was thinking of creating multiple workflows and passing run parameters, but the job bookmarks will not be stored for all tables at once. Is there any other way to do this
Can I create attachments to more than one subnet in a sing AZ?asked 3 years ago
Glue Jobs & Multiple tablesAccepted Answerasked 4 years ago
One glue job for multiple workflowsasked 6 days ago
AWS Glue crawler creating multiple tablesasked 5 months ago
Can we have one job for loading multiple tables with different bookmarks storing the last execution fileasked a month ago
Glue Workflow: Add same glue job multiple times, each with different parametersasked 8 days ago
Does anyone else have trouble downloading the RDS CA Bundle file from S3 on a CodeBuild job?asked 5 months ago
How to have multiple VPCs in different AWS accounts use the same physical AWS Direct Connect circuit.Accepted Answerasked 4 years ago
Can I have multiple private VIFs associated with one VGW?asked a year ago
Can we connect multiple File Gateways to one FSx ?Accepted Answerasked 2 months ago