Why is Step Function's Distributed Map slow?

1

I am working with Step Functions, and adding a Distributed Map as I have a high transaction volume and can split the work over parallel invocations. Also, it's beneficial for the parallel work to be an express step function as its short lived with multiple steps, whilst the overall job is long running so more suitable as a standard step function.

In setting this up, I discovered that when the Distributed Map is invoked, there seems to be a 7+ second overhead to the call (with 100 elements in the input). I created a very simple step function so I could eliminate any overhead from my actual steps, and its demonstrable in even the most trivial example.

Below is the example, which doesn't care for the step function input, generates an array of 100 integers, and uses that as the input to the Distributed Map. The map merely has a no-op Pass state.

Example Step Function Overview

{
  "Comment": "A description of my state machine",
  "StartAt": "Generate Array",
  "States": {
    "Generate Array": {
      "Type": "Pass",
      "Parameters": {
        "array.$": "States.ArrayRange(1,100,1)"
      },
      "Next": "Map"
    },
    "Map": {
      "Type": "Map",
      "ItemProcessor": {
        "ProcessorConfig": {
          "Mode": "DISTRIBUTED",
          "ExecutionType": "EXPRESS"
        },
        "StartAt": "Work",
        "States": {
          "Work": {
            "Type": "Pass",
            "End": true
          }
        }
      },
      "End": true,
      "ItemsPath": "$.array",
      "Label": "Map",
      "MaxConcurrency": 1000
    }
  }
}

Note in the output, the time between the MapRunStarted and MapRunSucceeded, 7+ seconds.

Example Output

When looking at the Map Run output, whilst the timestamps are in whole seconds, for all intents and purposes the work is all complete within a second of the MapRunStarted timestamp, yet it takes another 6+ seconds before the main step function continues with the results.

Map Run Output

It doesn't matter if the Distributed Map is configured as Express or Standard, there is always this overhead. It doesn't matter if the step function is defined via the console (as I did for the example), or via the CDK (which is what I use for my application code).

If I configure the Map as inline, it finishes very quickly, but of course this has incurred a state transition for each element of the array which would (depending on the complexity of the subtask) lead to higher billing, hence the reason for wanting to use an express distributed map.

By playing around with the number of array elements and the max concurrency, its possible to see that there may be an implicit overhead to processing each element (spinning up another workflow invocation or waiting for one to complete) beyond the actual run time of the mapped states. Changing to 1 element incurs ~1800ms of time in the map, which is very poor compared to an inline map (no measurable duration)

Further, if the example step function is changed so that there is a 1 second wait instead of the Pass state, 100 elements and max concurrency 50, the total run time is still in the order of ~7 seconds (in effect, the latter 50 concurrent waits took 1 second as would be expected). There just seems to be quite a performance overhead, although the overhead might reduce depending on how sustained the utilisation is. From my tests, its not clear when that kicks in, as there isn't much improvement per iteration for 1000 elements vs 100.

I'm assuming that a Distributed Map is merely a repackaging of calling a step function from within a step function, as it eliminates the state transitions associated with an inline map. It just doesn't seem like a very performant alternative, but maybe that is a the metaphorical and literal price to pay.

Whilst the overhead isn't a killer in my current use case, it makes me wary of using Distributed Maps without further understanding of why this delay happens.

asked 6 months ago132 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions