Glue output to Stream?


I am relatively new to AWS and have been researching using Glue for a specific use case;

What I would like to do is use Glue to rip apart a file into its component records and then push those records individually onto a queue or stream (thinking onto an EventBridge with Lambda resolvers that tackle different record types that are published upstream). Based on the documentation I'm seeing, it looks like while Glue can now consume a stream of data it doesn't seem to have the ability to output processed data to a stream.

I had considered creating a Lamdba to rip apart the file and then publish the records on to the stream but the files can be large and might exceed the limitations of a Lambda (time/size) so was thinking that Glue would be a better solution (not to mention it includes many ETL functions related to data cleansing, profiling, etc that I'd like to take advantage of). If not Glue is there a more appropriate solution to my problem?

Any suggestions would be welcomed.

TIA -Steve

1 Answer

Ultimately, you can solve this using multiple solutons. but let me start with these points addressing your question first:

  • Glue would not be ideal for your task of streaming the data into Kinesis stream or Queue. Glue processes data in multiple executors thereby using huge computing capacity.
  • Lambda does have a limitation of 15 minutes that would not work for large workloads. One option would be orchestrate the multiple lambda functions in parallel or one after another to process the file(s).

There are some possible solutions I can think of that you can build upon:

  • When reading objects from S3 you don't have to read the whole object at once, but you can use range requests to only read a part of the object. To do that all you have to do is to specify the range you want to read when calling get_object(). From the boto3 documentation for get_object(). You could orchestrate Lambdas from Step functions until all the data has been read.
  • If your team already has an EC2 instance running in the same region, you could run AWS CLI commands to download to EC2, split the files and upload back to S3. Those files can be processed by each Lambda
  • AWS Batch is another option you can consider- which can spin up an EC2 instance for just the time needed to perform the command. This would avoid all the splits and the orchestration. Your code can be in the language of your choice and it can read the files and write them into queues or Kinesis.
  • If your team already has an EC2 instance running in the same region, you could execute all the logic to read S3 and write into a queue or kinesis stream. This would avoid all the splits and the orchestration.
profile picture
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions