Glue, steps or Kinesis? Some guidance on when to choose what

0

We currently run into the Problem, that some data analytic workloads exceed the 15 minutes timeout of a lambda. It is a multistep process with steps that are parallelizable and some that are not. The main driver for the runtime is a list of regular expressions that need to run sequentially over each of the provided data items, this step is parallelizable (by splitting the input data), but the next step needs to have all the input data. Input and output data is stored in a postresql RDS. Now to the Question: what is the right scalable approach to tackle this problem? Is it: having one 'main lambda' that spawns synced child-lambdas to speedup the parallelizable steps? But the compilation of the regular expressions takes some time, so the spawning of multiple child lambdas will most likely be not feasable. Or is it building a data pipeline using AWS Glue? Or with Steps? Or even building a data stream using Kinesis? I think using EMR is a little bit too big for that use case. Some metrics of the usecase:

  • size of the regular expression list ~10Mb in RAM
  • compilation time of all regular expressions around 5-10 sekunds
  • runtime of a single item going through all steps is about 100ms, but some requests need to process >30k items, therefore we run into the timeout.

Any help in choosing the right architecture will be greatly appreciated

Alexander

3 Answers
2

I am going to divide my answer into 2 parts, first answering the title and then possible ways to solve the problem you currently have.

1 Use cases

1.1 Kinesis: It is a family of tools (kinesis data stream (KDS), kinesis data firehose (KDF) and kinesis data analytics (KDA)). When you work with kinesis what is usually to process data and have responses in real time (<1sec). (Documentation: https://docs.aws.amazon.com/kinesis/index.html)

1.2 Glue: It is designed especially to work with ETL or batch information processing (from minutes to hours depends on the amount of data). It is the tool to make ETLs in a serverless way. But they also added the ability to do data transformation on a data stream using Spark streaming (See https://docs.aws.amazon.com/glue/latest/dg/add-job-streaming.html)

1.3 Step functions: This is a tool that will allow you to coordinate actions between different types of resources. It is based on state machines. And you can pass information from one step to another in a limited way (If you want to send a lot of information you can put it in S3 and in the next Step read that file). (Here you can check which services you can interact with https://docs.aws.amazon.com/step-functions/latest/dg/connect-to-services.html)

2 Possible solutions

2.1 Your pipeline does not seem to need to be in real time, because it can wait for all the data to be processed to be later analyzed together (Kinesis discarded)

2.2 A batch solution very fast to compute and easy to follow

A) Develop a container to carry out all this process on the data and execute it with AWS Batch and save the result on S3 (You will have much more than 15 minutes to execute it)

B) Develop a lambda function to do the calculation on the processed items (less than 15 minutes) or also launch another container with AWS Batch and Save the data in S3

C) Load the data to RDS from S3 with lambda or AWS Batch, it will depend on the database engine, for example (https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.LoadFromS3. html)

D) All these tools can be coordinated from a step function, and you can start the step function based on an event that you select either place a file in S3 (Based on event) or every certain time of the day (Based on an expression time) using Cloudwatch Events

2.3 It does the whole process in Glue, using Python or Scala, but you need to work it with Spark and use the Glue data catalog.

I hope I've helped

profile picture
EXPERT
answered 2 years ago
AWS
EXPERT
reviewed 2 years ago
1

Step Functions is your friend for Orchestrating the workflow.

What I would do is the following flow.

  1. A lambda that crates a list of files to process and returns that list as output.'
  2. Use Map task to create dynamic parallel executions of the list https://aws.amazon.com/blogs/aws/new-step-functions-support-for-dynamic-parallelism
  3. The parallel processing invokes your lambda that reads the file from s3, does the regex and writes it back to a "processed" path in s3
  4. Once the parallel processing is done StepFunctions joins your workflow and you can have a lambda that does the rest of the work on the files.
AWS
triha
answered 2 years ago
0

You did not specify much about the source of the data. Whether your source data consumption is real time or batch AWS Glue would be the right solution for you. In order to use Glue you should be familiar with Python or Scala along with Spark. As per the Glue FAQ:-

Q: What is AWS Glue? AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all the capabilities needed for data integration, so you can start analyzing your data and putting it to use in minutes instead of months. AWS Glue provides both visual and code-based interfaces to make data integration easier. Users can easily find and access data using the AWS Glue Data Catalog. Data engineers and ETL (extract, transform, and load) developers can visually create, run, and monitor ETL workflows with a few clicks in AWS Glue Studio. Data analysts and data scientists can use AWS Glue DataBrew to visually enrich, clean, and normalize data without writing code. With AWS Glue Elastic Views, application developers can use familiar Structured Query Language (SQL) to combine and replicate data across different data stores.

AWS
Kunal_G
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions