Read data from S3 Data Lake & enrich the data through API

0

Hello All, I have a requirement where I would need to read data from an S3 data lake (Parquet format) and then filter/enrich the data through an API. I have was going through multiple solutions like after reading data through glue job, calling an api and filter it and insert into a database. Or Read the data using glue and write the data into application's S3 bucket and then have a batch read it from there and process by calling API from there. First of all, what is best possible solution in such a scenario. Second if I write the data into another s3 bucket what is the file format I should write it to, again parquet or csv or json so that a java batch can read that and process efficiently. What should my architecture look like for this case.

asked a year ago548 views
1 Answer
0

This could be a very complex situation, and there is probably not enough detail in your question to formulate ONE PERFECT SOLUTION. The complexity seems to be around calling this API.

So, instead here are some questions that might help you frame up possible solutions and overall guidelines:

  1. I would start with understanding what causes you to read from S3 datalake? Will this be on a schedule? Or based on a previous job completing (that puts data into the s3 datalake)? You would also need to consider using Bookmarks if this will be an incremental process, where the S3 source is continually updated.
  2. if you put the API call in a Glue job, would you need to call this API for every row in a dataframe? If so, then that is going to be very slow due to row-by-row processing. If you are calling the API just once for the entire job, then you could do that as part of a glue job.
  3. If you need to call the API for every row in a dataframe, then batching is probably better, and you then could think about using Lambda to call the API. Now you have to deal with the challenge of integrating the result of the API call back into the S3 dataset (source).
  4. This is where a datalake format like Iceberg could be helpful as your lambda could update the record in the Iceberg table with the results of the API call.
  5. You will also need to consider how to notify that all API calls have completed and you can now proceed to write so the destination.
  6. In a process similar to this I have used DynamoDB as a "batch log", meaning that when a new batch arrives, like your s3 source, I create batches in DynamoDB which are used to call enrichment functions (API calls via lambda). Then when the batch is completed, I give off an EventBridge custom event that tells a Glue workflow to proceed. You can orchestrate this "batch log" via Step Functions.
  7. As for the what format for your destination, most of the time you need to consider how that data will be consumed. Typically, people put data in ORC or Parquet for specific analytic workloads (row-based vs columnar). For heirarchical data, you can use JSON, which is what AWS does for metric streams that they write to S3, for example.
  8. Going back to points 1 and 6 - you could use Glue Workflows, which can be triggered based on EventBridge events - this will help you automate everything and be event driven.
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions