Step Functions: how to manage States.Runtime history events quota

0

Some of my Step Functions can have (very) long-running executions, hence the risk of reaching the maximum number of history events (25000) is not neglectable and I am required to manage/track my stepfunctions' failures.

How can I get aware of the history length and, in case, terminate the current stepfunction and starting another one with the current machine data?

Of course I can add a task and, via Lambda, retrieve the stepfunction history and manage the thing, but (since it would be expensive both in terms of lambda execution time, data transferred, and "wasted history event transitions") I wonder whether I can achieve the same from inside the stepfunction: something like a choice state that if the history length is greater than, say 24950, stops this and starts another state machine.

Any hints?

cionzo
已提问 1 年前941 查看次数
3 回答
1
已接受的回答

With regards to express workflow, I was referring only to what you do inside the loop, and not the entire state machine.

You are correct that using the nested route just delays the issue. It is a good solution if you have a max number of iterations, which is not the case in your state machine.

Given your situation, I would say the best approach will be to count the number of iterations. Do some checking to find out how many events are emitted in each loop and just start a new state machine when you reach the limit.

Another approach might be to use EventBridge scheduler. When you need to start an execution, you create a repeating schedule that invokes your state machine. The state machine only runs a single iteration and exits. If it is done, it deletes the schedule. This will work only if your wait state waits in increments of minutes.

profile pictureAWS
专家
Uri
已回答 1 年前
profile picture
专家
已审核 1 年前
  • Thanks. I think I'll modify my step function so that:

    • each state will increment a counter by the number of transitions needed for that state (2 for the Sleep state, 5 for the lambda invocation...);
    • update the choice state to check the value of the counter; and
    • add a state for starting a new execution with current state data before stopping the current execution.
  • Just note that not each state has the number of log entries. Some states might have 2 entries, some might have more.

0

There is no simple way to get the number of events in the execution history. What I would recommend is that you use nested workflows to start with, I am not familiar with your state machine, but I assume it has some sort of loop. In this case, either use the new Distributed Map state, which runs each iteration in its own Map Run, with its own history limit, or just invoke a nested workflow, which also has its own limit. Further more, you can choose (if appropriate for your use case), to use Express workflows for the nested ones, which do not have a limit at all (they save the history to CloudWatch Logs).

profile pictureAWS
专家
Uri
已回答 1 年前
0

Yes, my Step Function (see image) is essentially a big loop for monitoring IoT devices' status updates. The idea is that the Step Function keeps looping until the device is being used: so basically there is a step function machine execution for each device usage "session", that can last up to 24 hours.

Step Function scheme

I'm afraid that Express workflows are not suitable for my case (mainly because their execution can last up to five minutes).

I think that nesting workflows would mitigate but not solve the problem as it would slow down (but not stop) the growth of the execution history.

I'm not necessarily looking for a "simple" way to get the number of events in the execution history: I'm "just" looking for an "efficient" one.

Another quick and (very) dirty way to solve the problem would be to manually increase a counter each time a state is traversed, but I'd like to use a cleaner approach (the info I'm looking for must already be stored somewhere since it triggers the execution failure).

cionzo
已回答 1 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则