1 Answer
- Newest
- Most votes
- Most comments
0
This could be a very complex situation, and there is probably not enough detail in your question to formulate ONE PERFECT SOLUTION. The complexity seems to be around calling this API.
So, instead here are some questions that might help you frame up possible solutions and overall guidelines:
- I would start with understanding what causes you to read from S3 datalake? Will this be on a schedule? Or based on a previous job completing (that puts data into the s3 datalake)? You would also need to consider using Bookmarks if this will be an incremental process, where the S3 source is continually updated.
- if you put the API call in a Glue job, would you need to call this API for every row in a dataframe? If so, then that is going to be very slow due to row-by-row processing. If you are calling the API just once for the entire job, then you could do that as part of a glue job.
- If you need to call the API for every row in a dataframe, then batching is probably better, and you then could think about using Lambda to call the API. Now you have to deal with the challenge of integrating the result of the API call back into the S3 dataset (source).
- This is where a datalake format like Iceberg could be helpful as your lambda could update the record in the Iceberg table with the results of the API call.
- You will also need to consider how to notify that all API calls have completed and you can now proceed to write so the destination.
- In a process similar to this I have used DynamoDB as a "batch log", meaning that when a new batch arrives, like your s3 source, I create batches in DynamoDB which are used to call enrichment functions (API calls via lambda). Then when the batch is completed, I give off an EventBridge custom event that tells a Glue workflow to proceed. You can orchestrate this "batch log" via Step Functions.
- As for the what format for your destination, most of the time you need to consider how that data will be consumed. Typically, people put data in ORC or Parquet for specific analytic workloads (row-based vs columnar). For heirarchical data, you can use JSON, which is what AWS does for metric streams that they write to S3, for example.
- Going back to points 1 and 6 - you could use Glue Workflows, which can be triggered based on EventBridge events - this will help you automate everything and be event driven.
answered a year ago
Relevant content
- asked a year ago
- asked 3 years ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 years ago