How can I use a SAW runbook to troubleshoot my AWS Batch job stuck in the RUNNABLE status?

Lecture de 5 minute(s)
1

My AWS Batch job is stuck in the RUNNABLE status. I want to use an AWS Support Automation Workflow (SAW) runbook to troubleshoot this.

Short Description

To help you troubleshoot, remediate, manage, and reduce costs on your AWS resources, AWS Support maintains a subset of the AWS provided predefined runbooks. These runbooks are prefixed with AWSSupport- or AWSPremiumSupport-.

To troubleshoot an AWS Batch job that's stuck in the RUNNABLE status, use the AWSSupport-TroubleshootAWSBatchJob SAW runbook. This automated workflow verifies the job and its underlying infrastructure configuration.

For more information, see AWS Support Automation Workflows (SAW).

Resolution

The AWSSupport-TroubleshootAWSBatchJob runbook checks various conditions that might cause a Batch job to get stuck in a RUNNABLE status:

  • Your compute environment is in an INVALID or DISABLED state.
  • The compute environment's Max vCPU parameter isn't large enough for your vCPU to accommodate the job volume in the job queue.
  • Your jobs require more vCPUs or memory resources than what your compute environment's instance types can provide.
  • Your compute environment isn't configured to use GPU instances, but your job requires them.
  • The EC2 Auto Scaling group for the compute environment failed to launch instances.
  • The EC2 Auto Scaling Group launched instances successfully, but the instances can't join the underlying Amazon ECS cluster.
  • An AWS Identity and Access Management (IAM) or permissions issue is blocking specific actions that are required.

Prerequisites

Before you begin, make sure that your IAM user or role has the correct permissions. These permissions include AWS Systems Manager permissions and the following permissions for specific services:

  • cloudtrail:LookupEvents
  • iam:GetInstanceProfile
  • iam:GetRole
  • iam:ListRoles
  • iam:PassRole
  • iam:SimulateCustomPolicy
  • iam:SimulatePrincipalPolicy
  • sts:GetCallerIdentity
  • ecs:DescribeClusters
  • ecs:DescribeContainerInstances
  • ecs:ListContainerInstances
  • ssm:GetAutomationExecution
  • ssm:StartAutomationExecution
  • ssm:DescribeAutomationStepExecutions
  • ssm:DescribeAutomationExecutions
  • ec2:DescribeIamInstanceProfileAssociations
  • ec2:DescribeInstanceAttribute
  • ec2:DescribeInstances
  • ec2:DescribeInstanceTypeOfferings
  • ec2:DescribeInstanceTypes
  • ec2:DescribeNetworkAcls
  • ec2:DescribeRouteTables
  • ec2:DescribeSecurityGroups
  • ec2:DescribeSubnets
  • ec2:DescribeVpcEndpoints
  • ec2:DescribeVpcs
  • ec2:DescribeSpotFleetInstances
  • ec2:DescribeSpotFleetRequests
  • ec2:DescribeSpotFleetRequestHistory
  • batch:DescribeJobs
  • batch:DescribeJobQueues
  • batch:DescribeComputeEnvironments
  • batch:ListJobs
  • autoscaling:DescribeAutoScalingGroups
  • autoscaling:DescribeScalingActivities

Run the AWSSupport-TroubleshootAWSBatchJob automation

  1. Open the AWSSupport-TroubleshootAWSBatchJob runbook. Note: Make sure that the AWS Region for the runbook matches the Region where the RUNNABLE job resides.
  2. Choose Execute automation.
  3. For Input parameters, enter the following information:
    JobId: The ID of the AWS Batch Job that's stuck in the RUNNABLE status.
    (Optional) AutomationAssumeRole: The Amazon Resource Name (ARN) of the IAM role that allows Systems Manager Automation to perform the actions on your behalf. If you don't specify a role, then Systems Manager Automation uses the permissions of the user that starts this runbook.
  4. Choose Execute. This initiates the automation workflow.
  5. After the automation finishes, review the results in the Outputs section.

The runbook output provides troubleshooting steps, findings, and recommendations.

Example outputs for the AWSSupport-TroubleshootAWSBatchJob runbook

If the AWS Batch job doesn't exist in the current Region, then you get the following output:

#########################
EXECUTION RESULT SUMMARY
#########################
Here is the summary of the execution of this runbook:

[ERROR]: Job with ID "00000000-1111-2222-3333-444444444444" does not exist in this region (eu-west-3). Please verify that you are running this automation in the same region of the Job.
For details on how to review the job information, refer to the following documentation https://docs.aws.amazon.com/batch/latest/userguide/review-job-info.html



#######################
RUNBOOK EXECUTION LOGS
#######################

+++++++++++++++++++++++++++++++++
STEP:PreflightPermissionChecks
+++++++++++++++++++++++++++++++++
[INFO]: The IAM Identity used to execute the runbook has all required permissions, proceeding further for next steps in execution.

++++++++++++++++++++++++++++++
STEP:AWSBatchJobEvaluation
++++++++++++++++++++++++++++++
[ERROR]: Job with ID "30710dec-f7c1-48c1-ab4d-c8772687e6f0" does not exist in this region (eu-west-3). Please verify that you are running this automation in the same region of the Job.
For details on how to review the job information, refer to the following documentation https://docs.aws.amazon.com/batch/latest/userguide/review-job-info.html

If the underlying Auto Scaling Group of the compute environment failed to launch instances, then you get the following output:

#########################
EXECUTION RESULT SUMMARY
#########################
Here is the summary of the execution of this runbook:

[ERROR]: Auto Scaling Group "AWSBatch-ComputeEnvironment-mHLZX9A7V7O13M95-asg-67af9df8-d688-3dbc-9838-753582669314" failed with error message:
 Instance became unhealthy while waiting for instance to be in InService state. Termination Reason: Client.InternalError: Client error on launch.
Please see the following link for instructions how to troubleshoot and remediate this issue https://docs.aws.amazon.com/autoscaling/ec2/userguide/ts-as-instancelaunchfailure.html#ts-as-instancelaunchfailure-12


#######################
RUNBOOK EXECUTION LOGS
#######################

+++++++++++++++++++++++++++++++++
STEP:PreflightPermissionChecks
+++++++++++++++++++++++++++++++++
[INFO]: The IAM Identity used to execute the runbook has all required permissions, proceeding further for next steps in execution.

++++++++++++++++++++++++++++++
STEP:AWSBatchJobEvaluation
++++++++++++++++++++++++++++++
[INFO]: Job with ID "30710dec-f7c1-48c1-ab4d-c8772687e6f0" exists and is in RUNNABLE status, proceeding further for next steps in execution.

++++++++++++++++++++++++++++++++++++++++++
STEP:BatchComputeEnvironmentEvaluation
++++++++++++++++++++++++++++++++++++++++++

[INFO]: Reviewing Compute Environment "ComputeEnvironment-mHLZX9A7V7O13M95":
[WARNING]: The automation detected that you are using BEST_FIT allocation strategy for your Compute Environment "ComputeEnvironment-mHLZX9A7V7O13M95".
In general, we recommend the BEST_FIT strategy only when you want the lowest cost for your instance, and you are willing to trade cost for throughput and availability.
To favor availability, consider using BEST_FIT_PROGRESSIVE for on-demand and SPOT_CAPACITY_OPTIMIZED for spot. For more information see https://docs.aws.amazon.com/batch/latest/userguide/allocation-strategies.html
[INFO]: Compute Environment: "ComputeEnvironment-mHLZX9A7V7O13M95" meets resource requirements to run the Job: "30710dec-f7c1-48c1-ab4d-c8772687e6f0".
Therefore, proceeding with the next checks...


++++++++++++++++++++++++++++++++++
STEP:UnderlyingInfraEvaluation
++++++++++++++++++++++++++++++++++
[ERROR]: Auto Scaling Group "AWSBatch-ComputeEnvironment-mHLZX9A7V7O13M95-asg-67af9df8-d688-3dbc-9838-753582669314" failed with error message:
 Instance became unhealthy while waiting for instance to be in InService state. Termination Reason: Client.InternalError: Client error on launch.
Please see the following link for instructions how to troubleshoot and remediate this issue https://docs.aws.amazon.com/autoscaling/ec2/userguide/ts-as-instancelaunchfailure.html#ts-as-instancelaunchfailure-12
AWS OFFICIEL
AWS OFFICIELA mis à jour il y a 9 mois