Skip to content

AWS State machine Stuck in Running State

0

I had a parent state machine that started child state machine using distributed. The child state machine completed but the parent had been running since 1 day and I had to abort it. This issue only started to arise recently and occurs very in frequently.

Using this method to start the child State machine: arn:aws:states:::states:startExecution.sync

**Parent State machine events **Parent State machine events

**Child State machine events end section **Child State machine events end section

asked a year ago296 views
2 Answers
0
Accepted Answer

The scenario you shared can occur if the execution role used by the parent execution does not have permissions to call DescribeExecution on the child. The startExecution.sync service integration implements two complimentary mechanisms to report completion from the child back to the parent. The primary mechanism is to poll the DescribeExecution API Action on a regular basis which requires the execution role to have these permissions. A secondary mechanism, meant to reduce latency, involves an EventBridge managed rule that captures Execution Status Change events from child executions. Due to timing factors, the latter is not 100% reliable, which is why the former remains required. However, you can configure your workflow with an execution role that lacks the DescribeExecution permission and this will work fine most of the time. But in the rare cases where the timing is just right, the event will not complete the parent correctly and you can see the behavior you described.

Can you check to confirm that your execution role has these permissions?

AWS
answered a year ago
EXPERT
reviewed a year ago
  • I only have StartExecution permissions for the the child state machine. Will update by role to have the DescribeExecution permissions as well, Thanks for the help

-1

The issue you're experiencing with your parent state machine getting stuck in a running state, despite the child state machine completing, is likely related to a known problem in AWS Step Functions. This can occur when using the distributed map state to start child executions, especially with the .sync integration pattern.

When using arn:aws:states:::states:startExecution.sync to start a child state machine, the parent state machine waits for the child execution to complete before proceeding. However, in some cases, there can be a delay or failure in detecting the completion of the child execution, causing the parent to remain in a running state.

To address this issue, you can try the following approaches:

  1. Implement timeouts: Add a task-level timeout to the state that starts the child execution. This will prevent the parent from getting stuck indefinitely if there's an issue with detecting the child's completion.

  2. Use CloudWatch metrics: Monitor the ActivityScheduleTime CloudWatch metric and set up an alarm to alert you when this time increases abnormally. This can help you detect potential issues earlier.

  3. State machine-level timeout: Define a timeout at the state machine level by specifying a TimeoutSeconds field at the beginning of the state machine definition, outside of the States field. This will ensure that the entire execution times out if it runs for too long.

  4. Error handling: Implement robust error handling in your state machine, including Catch and Retry mechanisms, to gracefully handle potential failures in child execution detection.

  5. Check for service limits: Ensure that you're not hitting any Step Functions service limits, such as the maximum number of open executions (1,000,000 per AWS account per region).

If the issue persists, you may want to consider alternative patterns for orchestrating your workflows or reach out to AWS support for further assistance, as this could be related to an underlying service issue that AWS needs to address.
Sources
Troubleshooting issues in Step Functions - AWS Step Functions
State machine structure in Amazon States Language for Step Functions workflows - AWS Step Functions

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.