Step Functions: how to manage States.Runtime history events quota

0

Some of my Step Functions can have (very) long-running executions, hence the risk of reaching the maximum number of history events (25000) is not neglectable and I am required to manage/track my stepfunctions' failures.

How can I get aware of the history length and, in case, terminate the current stepfunction and starting another one with the current machine data?

Of course I can add a task and, via Lambda, retrieve the stepfunction history and manage the thing, but (since it would be expensive both in terms of lambda execution time, data transferred, and "wasted history event transitions") I wonder whether I can achieve the same from inside the stepfunction: something like a choice state that if the history length is greater than, say 24950, stops this and starts another state machine.

Any hints?

3 Answers
1
Accepted Answer

With regards to express workflow, I was referring only to what you do inside the loop, and not the entire state machine.

You are correct that using the nested route just delays the issue. It is a good solution if you have a max number of iterations, which is not the case in your state machine.

Given your situation, I would say the best approach will be to count the number of iterations. Do some checking to find out how many events are emitted in each loop and just start a new state machine when you reach the limit.

Another approach might be to use EventBridge scheduler. When you need to start an execution, you create a repeating schedule that invokes your state machine. The state machine only runs a single iteration and exits. If it is done, it deletes the schedule. This will work only if your wait state waits in increments of minutes.

profile pictureAWS
EXPERT
Uri
answered a year ago
profile picture
EXPERT
reviewed a year ago
  • Thanks. I think I'll modify my step function so that:

    • each state will increment a counter by the number of transitions needed for that state (2 for the Sleep state, 5 for the lambda invocation...);
    • update the choice state to check the value of the counter; and
    • add a state for starting a new execution with current state data before stopping the current execution.
  • Just note that not each state has the number of log entries. Some states might have 2 entries, some might have more.

0

There is no simple way to get the number of events in the execution history. What I would recommend is that you use nested workflows to start with, I am not familiar with your state machine, but I assume it has some sort of loop. In this case, either use the new Distributed Map state, which runs each iteration in its own Map Run, with its own history limit, or just invoke a nested workflow, which also has its own limit. Further more, you can choose (if appropriate for your use case), to use Express workflows for the nested ones, which do not have a limit at all (they save the history to CloudWatch Logs).

profile pictureAWS
EXPERT
Uri
answered a year ago
0

Yes, my Step Function (see image) is essentially a big loop for monitoring IoT devices' status updates. The idea is that the Step Function keeps looping until the device is being used: so basically there is a step function machine execution for each device usage "session", that can last up to 24 hours.

Step Function scheme

I'm afraid that Express workflows are not suitable for my case (mainly because their execution can last up to five minutes).

I think that nesting workflows would mitigate but not solve the problem as it would slow down (but not stop) the growth of the execution history.

I'm not necessarily looking for a "simple" way to get the number of events in the execution history: I'm "just" looking for an "efficient" one.

Another quick and (very) dirty way to solve the problem would be to manually increase a counter each time a state is traversed, but I'd like to use a cleaner approach (the info I'm looking for must already be stored somewhere since it triggers the execution failure).

cionzo
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions