StepFunction wait state waiting for more than the requested wait time

0

I have a stepfunction wait state for which I am passing the wait time in seconds. It is supposed to wait for 180 seconds but in few cases it waited for more than 1 hour. I also tested the state using the Test State feature and there it waited for exactly 180 seconds with the same input data. I can provide the exact step-function if needed. Any idea why it might have happened?

asked 14 days ago82 views
2 Answers
0
Accepted Answer

Hello,

It can be possible that Step Function might experience throttling 'StateTransition' throttle.

[+] https://docs.aws.amazon.com/step-functions/latest/dg/service-quotas.html#service-limits-api-state-throttling

To answer your question, we require details that are non-public information like execution history and step function definition to troubleshooting further. Please open a support case with AWS using the following link

[+] https://console.aws.amazon.com/support/home#/case/create

AWS
answered 13 days ago
  • Yes, this is the case. I checked cloudwatch metrics and was saw ExecutionThrottled

0

The behavior you're describing, where a Step Functions Wait state is waiting significantly longer than the specified time, is unusual and not expected. Typically, the Wait state should pause the execution for the exact amount of time specified, whether it's in seconds or until a specific timestamp.

However, there are a few potential reasons why this might occur:

  1. Input Data Issues: Ensure that the input data for the Wait state is correct and consistent. If you're using a SecondsPath or TimestampPath to specify the wait time dynamically, make sure the path is correct and the data it's pointing to is valid.

  2. State Machine Definition: Double-check your state machine definition to ensure there are no unintended loops or transitions that could cause the execution to stay in a waiting state longer than expected.

  3. Service Limits: While unlikely, it's possible that you might be hitting service limits or experiencing service-side issues. Step Functions has a maximum execution time of 1 year for Standard workflows, so this shouldn't be the cause of your issue.

  4. Execution Role Permissions: Ensure that the IAM role associated with your state machine has the necessary permissions to execute all steps in your workflow.

  5. CloudWatch Logs: If you have enabled logging for your Step Functions, check the CloudWatch Logs for any error messages or unexpected behavior.

  6. Stuck Execution: In rare cases, executions can get stuck. Implementing a state machine-level timeout could help prevent this.

Since you mentioned that the Test State feature worked correctly, it's likely that the issue is related to how the state machine is being executed in your actual workflow rather than a problem with the Wait state itself.

To further diagnose the issue, you should:

  1. Enable detailed CloudWatch logging for your Step Functions if you haven't already.
  2. Monitor the execution closely using the Step Functions console or AWS CLI.
  3. Implement state machine-level timeouts to prevent executions from running indefinitely.
  4. If the problem persists, you may need to contact AWS Support for further investigation, as this behavior is not typical for Step Functions Wait states.

Remember, the Wait state in Step Functions considers time values up to seconds and truncates milliseconds, so precision beyond seconds is not guaranteed. However, this should not account for the significant delay you're experiencing.
Sources
Wait workflow state - AWS Step Functions
Best practices for Step Functions - AWS Step Functions
Troubleshooting issues in Step Functions - AWS Step Functions

profile picture
answered 14 days ago
  • I have re-checked the input output and overall workflow. Everything seems fine. I have doubt about two points:

    1. Service Limits: I triggered two workflows with about 180,000 executions, all of which were waiting at that wait state. About 5000 of them were stuck at this wait state. Would it have happened due to any default service limits?
    2. Stuck Execution: You mentioned that in rare case the executions can get stuck. How to check if it were those rare cases?
  • Could you ellaborate a bit more how you are running those executions? Do those 180k executions run simultaneously? If not, what is the interval between them? What region are you using?

  • Do those 180k executions run simultaneously?

    Yes those executions run simultaneously. The sequence is like this.

    1. Start at a task state 'ComputeWaitTime' which computes the wait time. It returns the time in seconds till 9.00 pm.
    2. Then it goes to a choice state where if waitTime is 0, it sends to a task state 'StartProcess', else it sends to the wait state 'WaitOp'.
    3. In WaitOp, it is supposed to wait for the given time. For the referred execution, it was triggered at 8.57 pm, hence the wait time was 180 seconds. Same was there in input, I confirmed.
    4. When this wait time is over, it goes to ComputeWaitTime.

    In my scenario, I triggered those 180,000 executions between 8:51 pm to 08:59 pm. Hence all of them were waiting at the wait state with wait time ranging from 60 seconds to 540 seconds. When time reached 9:00 pm, most of them exited the wait state and went to next state. However, a few of them did not exit the wait state, one of which I mentioned above. In the Events of this execution, I saw this sequence: WaitStateEntered (WaitOp) at May 1, 2025, 20:57:02.651 Started After = 00:00:27.491 WaitStateExited (WaitOp) at May 1, 2025, 22:01:16.243 Started After = 01:04:41.083 TaskStateEntered (ComputeWaitTime) at May 1, 2025, 22:01:16.243 Started After = 01:04:41.083

    The WaitStateExited should have happened at May 1, 2025, 21:00:00.000 but it did not. Note: ComputeWaitTime is my own service task for which I have limited workers at a time. Can this be issue?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions