Skip to content

Need help in designing solution to read Step function distributed Mode ( map ) result

0

Hello everyone ,

We have use case where we need to create workflow which takes csv file from user via tool and then read data from file and process records and make some internal api call and return result in another output file.

We are trying to use step function distributed mode for this which can read the data from s3 directly runs the process and store the result in output file in s3 .

Now what would be best design to read the output and create our own final file with result . Sharing sample records which got created in file.

I thought of using lambda to read this file (generated by step distributed mode) but I am not sure if it will be able to read that file with in 15 mins . ` [

{

"ExecutionArn": "arn:aws:states:us-eat:12345:execution:ChunkProcessor/Map:1",

"Input": "{"email":"1234"}",

"InputDetails": {

"Included": true

},

"Name": "1",

"Output": "{"email":"12233","preferences":{"email":{"opt_in":"OK to Contact"},"mail":{"opt_in":"OK to Contact"},"phone":{"opt_in":"OK to Contact"}},"statusType":{"code":200,"status":"Success","text":"Service invoked successfully."}}",

"OutputDetails": {

"Included": true

},

"RedriveCount": 0,

"RedriveStatus": "NOT_REDRIVABLE",

"RedriveStatusReason": "Execution is SUCCEEDED and cannot be redriven",

"StartDate": "2025-01-23T20:08:33.382Z",

"StateMachineArn": "arn:aws:states:us-east-1:123456:stateMachine:ChunkProcessor/Map",

"Status": "SUCCEEDED",

"StopDate": "2025-01-23T20:08:34.526Z"

},

{

"ExecutionArn": "arn:aws:states:us-east:123456:execution:ChunkProcessor/Map:2",

"Input": "{"email":"3232"}",

"InputDetails": {

"Included": true

},

"Name": "2",

"Output": "{"email":"3232","preferences":{"email":{"opt_in":"OK to Contact"},"mail":{"opt_in":"OK to Contact"},"phone":{"opt_in":"OK to Contact"}},"statusType":{"code":200,"status":"Success","text":"Service invoked successfully."}}",

"OutputDetails": {

"Included": true

},

"RedriveCount": 0,

"RedriveStatus": "NOT_REDRIVABLE",

"RedriveStatusReason": "Execution is SUCCEEDED and cannot be redriven",

"StartDate": "2025-01-23T20:08:33.376Z",

"StateMachineArn": "arn:aws:states:us-east:123456:stateMachine:ChunkProcessor/Map",

"Status": "SUCCEEDED",

"StopDate": "2025-01-23T20:08:34.532Z"

}

]`

  • Could you share the amount of data that the input and output would likely be? That would help to understand the scale and thereby possible limitations

asked 10 months ago194 views
1 Answer
1

To design a solution for reading the results of a Step Functions Distributed Map state and create a final output file, you have a few options:

  1. Use AWS Lambda: While Lambda has a 15-minute execution limit, it can be a good option if your result file is not too large. You can create a Lambda function that reads the output file from S3, processes the data, and generates your final output file. If the processing time is close to or exceeds 15 minutes, consider breaking the task into smaller chunks or using a different approach.

  2. Use AWS Glue: For larger datasets or more complex processing, AWS Glue might be a better choice. Glue is a fully managed extract, transform, and load (ETL) service that can handle large-scale data processing jobs. You can create a Glue job to read the Distributed Map output file, process the data, and generate your final output file.

  3. Use Amazon EMR: For very large datasets or complex processing requirements, Amazon EMR (Elastic MapReduce) could be a suitable option. EMR allows you to process vast amounts of data using a cluster of EC2 instances.

  4. Implement a custom solution using EC2: If you need more control over the processing environment or if your processing requirements are unique, you could set up an EC2 instance to read the output file, process the data, and generate the final output.

Regardless of the method you choose, here's a general approach to processing the Distributed Map output:

  1. Read the output file from S3.
  2. Parse the JSON data for each execution result.
  3. Extract the relevant information from each result (e.g., the "Output" field).
  4. Process and aggregate the data as needed for your use case.
  5. Generate your final output file and store it in S3 or another appropriate location.

When designing your solution, consider factors such as the size of your input data, the complexity of your processing requirements, the frequency of executions, and your performance needs. Also, ensure that your chosen method has the necessary permissions to access the S3 buckets and any other required AWS resources.

Remember to implement error handling and logging in your processing logic to handle any issues that may arise during the data processing stage.
Sources
Using Map state in Distributed mode for large-scale parallel workloads in Step Functions - AWS Step Functions
Copying large-scale CSV data using Distributed Map in Step Functions - AWS Step Functions

answered 10 months ago
EXPERT
reviewed 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.