Using SAW to diagnose common issues in your AWS environment

12 minute read
Content level: Intermediate
2

This article demonstrates how SAW can automate troubleshooting to diagnose common issues in your AWS environment before you contact AWS Support.

Introduction

Manually troubleshooting and resolving system issues in your AWS environment can be repetitive and error-prone. AWS Support offers AWS Support Automation Workflows (SAW), a feature that provides self-service diagnosis and remediation for you. Leveraging AWS System Manager, SAW offers a user-friendly collection of curated automation runbooks that simplifies the troubleshooting process and provides resolution steps. You can use runbooks to quickly troubleshoot connectivity issues, diagnose permission errors, and reset Amazon Elastic Compute Cloud (Amazon EC2) access. Automation runbooks are typically prefixed with AWSSupport or AWSPremiumSupport. They are available for range of AWS services, including but not limited to, Amazon EC2, Amazon Simple Storage Service (Amazon S3), Amazon Elastic Kubernetes Service (EKS), and Amazon Elastic Container Service (ECS).

In this article, you will learn how SAW can automate troubleshooting to diagnose common issues in your AWS environment before you contact AWS Support. The streamlined troubleshooting that SAW runbooks provide allows faster root cause analysis and remediation of system problems.

Solution overview

AWS Systems Manager is the operations hub for your AWS applications and resources and is a secure end-to-end management solution. SAW are automation runbooks that are built on top of AWS Systems Manager Automation. These runbooks help you troubleshoot common issues with your AWS resources, proactively monitor and identify network issues, and collect and analyze logs.

Common use cases

AWS developed SAW runbooks based on customer experience. When you contact AWS Support, AWS support engineers solve and document the issue with resolution. After observing the trend of recurring issues, AWS built custom tools that promote self-service to better support you.

SAW leverages experience, best practices, and lessons learned over the years to eliminate repetitive, time-consuming, and manual customer tasks, making it a powerful tool for addressing a variety of troubleshooting issues. SAW can be used to troubleshoot SSH connectivity issues, analyze Amazon EC2 disk usage, diagnose Amazon S3 issues, and collect essential logs on Amazon EKS or Amazon ECS environments. These runbooks can be particularly useful in situations where there is no SSH access to the EC2 instances, or when the instances are in a private subnet with a System Manager Virtual Private Cloud (VPC) endpoint turned on. A few other use cases include the following:

  • Diagnose, troubleshoot, and provide remediation: Use AWSPremiumSupport-TroubleshootEC2DiskUsage to investigate and potentially remediate issues with EC2 instance disk usage. You can also use this runbook to automate the extension of volume, partitions, and file systems at the operating system level.
  • Turn on automatic management analysis and configuration update: Use AWSSupport-EnableVPCFlowLogs to configure Amazon VPC Flow Logs for multiple subnets, network interfaces, and VPCs in your AWS account.
  • Cost optimization and operational review: Use AWSPremiumSupport-PostgreSQLWorkloadReview to capture multiple snapshots of your Amazon Relational Database Service (Amazon RDS) for PostgreSQL database usage statistics.
  • Log collection for diagnostic purposes: Use the AWSSupport-CollectEKSInstanceLogs to collect operating system level log files from Amazon EKS for troubleshooting cluster issues.

To see other runbooks that can help with your use cases, see AWS Support Automation Workflows (SAW).

Solution walkthrough

Meet the prerequisites

View Systems Manager documents

To view a particular document, search the Systems Manager document store through either free text search or a filter-based search. For more information, see Searching for SSM documents.

To find all SAW runbooks that are managed by AWS Support, enter AWSSupport in the search box.

Enter image description here

Run automations

Use the Systems Manager console to run an automation. For more information, see Running a simple automation (console).

For example, you can use the AWSSupport-ListEC2Resources Automation runbook to see information about your EC2 instances and related resources. For more information, see How do I use the AWSSupport-ListEC2Resources Automation runbook to get information for all the EC2 resources in my account?

To run an automation with the AWS Command Line Interface (AWS CLI), see Running a simple automation (command line).

If you receive errors when you run AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.

For example, run the following command to list your EC2 resources in all AWS Regions:

aws ssm start-automation-execution --document-name "AWSSupport-ListEC2Resources" --parameters '{"RegionsToQuery":["All"]}'

The output looks like the following:

{
    "AutomationExecutionId": "6053b7c6-7ec7-4b9b-b52c-04ddd912ede1"
}

To retrieve the status of the automation, run the following command:

aws ssm describe-automation-executions \
    --filter "Key=ExecutionId,Values=6053b7c6-7ec7-4b9b-b52c-04ddd912ede1"

Replace 6053b7c6-7ec7-4b9b-b52c-04ddd912ede1 in the command with your automation execution ID.

The output looks like the following:

{
    "AutomationExecutionMetadataList": [
        {
            "AutomationExecutionId": "6053b7c6-7ec7-4b9b-b52c-04ddd912ede1",
            "DocumentName": "AWSSupport-ListEC2Resources",
            "DocumentVersion": "7",
            "AutomationExecutionStatus": "InProgress",
            "ExecutionStartTime": "2024-03-13T09:53:45.685000+00:00",
            "ExecutedBy": "arn:aws:sts::123456789012:assumed-role/Administrator/Admin",
            "LogFile": "",
            "Outputs": {},
            "Mode": "Auto",
            "CurrentStepName": "listVolumes",
            "CurrentAction": "aws:executeScript",
            "Targets": [],
            "ResolvedTargets": {
                "ParameterValues": [],
                "Truncated": false
            },
            "AutomationType": "Local"
        }
    ]
}

Use cases

Troubleshoot SSH connectivity issues

The AWSSupport-TroubleshootSSH runbook resolves typical Linux SSH issues. This SAW runbook installs the Amazon EC2Rescue tool for Linux (ec2rl) and tries to fix common issues that prevent a remote connection to the Linux machine through SSH. It also supports your EC2 instance if the instance isn’t managed by Systems Manager.

When you can’t connect to your EC2 instance, run the AWSSupport-TroubleshootSSH automation to diagnose the issue and fix it, regardless of whether you installed SSM Agent in your EC2 instance. For more information, see I'm receiving errors when trying to connect to my EC2 instance using SSH. How can I use the AWSSupport-TroubleshootSSH automation workflow to troubleshoot SSH connection issues?

Recover Amazon EC2 access and password

AWSSupport-ResetAccess runbook aims to resolve access issues that are caused by missing Amazon EC2 management permissions on Windows or Linux. It uses the EC2Rescue tool on the specified EC2 instance to turn on password decryption through the Amazon EC2 console (Windows) or to generate and add a new SSH key pair (Linux).

If you lost your key pair for your EC2 Windows instance, then this automation creates a password-enabled AMI that you can use to launch a new EC2 instance with a new key pair.

Note: This automation stops the instance. The public IP address also changes if you aren’t using an Elastic IP address. Therefore, make sure that you have no data on the attached instance-store volume before you run this automation.

Run this automation to recover the access to your EC2 instance. When you run the automation, for SubnetID, select the subnet ID of your problematic EC2 instance, or keep it as the default value CreateNewVPC.

After this step is complete, you can find the instructions for the next steps in the Outputs section. This runbook creates a backup AMI and a password-enabled AMI. Use the password-enabled AMI to launch a new EC2 Windows instance with a new key pair.

For more information, see Reset passwords and SSH keys on EC2 instances.

You can use the same runbook to recover the SSH access of an EC2 Linux instance. In this case, the runbook produces a new SSH private key in a secure string that you can find in AWS System Manager Parameter Store with the name /ec2rl/openssh/<instance ID>/key. You can decrypt to get the private key.

Analyze EC2 disk usage and extend the free disk space

If you’re an AWS Business, Enterprise On-Ramp, or Enterprise Support plan customer, then you can also benefit from the AWSPremiumSupport-TroubleshootEC2DiskUsage runbook to analyze your EBS volume usage and automate the extension of partitions and file systems at the operating system level. It's available in AWS Regions, AWS China Regions, and AWS GovCloud (US) Regions, and supports both Windows and Linux instances. For more information, see How do I automatically evaluate and remediate the increasing volume on an Amazon EC2 instance when free disk space is low?

For example, if you’re running out of the free disk space in your Linux file system and you want to expand them, then you must increase the size of your EBS volume, and then run multiple Linux command line tools to extend your file system. If you have a large group of EC2 instances and need to expand the disk space, then it can be a significant amount of time and effort, causing operational burden.

You can leverage this automation runbook with the aws ssm start-automation-execution command or the StartAutomationExecution API to run the automation in batches. This runbook extends the volume and file system to help you diagnose the disk usage and remediate issues.

Collect essential logs from Amazon EKS or Amazon ECS nodes

When an instance that’s added to an EKS or ECS cluster encounters an issue, you can collect important system level information by running the EKS logs collector or ECS Log collector scripts. This process helps you collect the logs and details, such as kubelet, ECS agent, and system configurations. However, running the script and collecting logs can be difficult when the EC2 instance doesn’t provide SSH accessibility or is restricted by a private network configuration. AWSSupport-CollectEKSInstanceLogs and AWSSupport-CollectECSInstanceLogs runbooks can assist with this problem. These runbooks automate the log collection process. They run the log collector scripts and store the result on the EC2 instance. They also provide another option to upload bundle logs to an S3 bucket. This option is useful when you need to review multiple instances at one time in a central place.

Run this automation for an Amazon EKS node.

Run this automation for an Amazon ECS node.

Troubleshoot EKS cluster

The AWSSupport-TroubleshootEKSWorkerNode runbook helps you quickly identify possible causes for why a Kubernetes worker node fails to join your EKS cluster.

Run this automation to help you identify and troubleshoot common causes that prevent worker nodes from joining a cluster. The runbooks show the possible errors and causes in the Outputs section.

Enter image description here
If you have the AWS Business, Enterprise On-Ramp, or Enterprise Support plan, then you have access to the AWSPremiumSupport-TroubleshootEKSCluster runbook. For more information, see How can I troubleshoot errors in my Amazon EKS environment after I create a cluster?

Troubleshoot Amazon ECS issues

The AWSSupport-TroubleshootECSTaskFailedToStart runbook can help you troubleshoot why an Amazon ECS task in an ECS cluster doesn’t start. By automatically reviewing the configuration and testing connectivity, the runbook streamlines the analysis of common issues that can prevent a task from starting. Also, it provides you actionable guidance to fix the problem. Run this automation to troubleshoot why an Amazon ECS task fails to start.

For instance, if you have an Amazon ECS task that fails to start with the following connectivity error in a public subnet, then you typically check the route table, network setting, and network configuration of your VPC and subnets.

Enter image description here

To simplify this analysis, run this automation with input parameters ClusterName and TaskId. The output looks like the following:

Enter image description here

In this case, the output mentions that you didn’t turn on the auto-assign public IP option when you launched your Amazon ECS task. You can also see the remediation steps in the output.

Conclusion

This article shares several SAW use cases and demonstrates how SAW runbooks can simplify your troubleshooting process. To learn more, explore the following resources:

AWS Support engineers and Technical Account Managers (TAMs) can help you with general guidance, best practices, troubleshooting, and operational support on AWS. To learn more about our plans and offerings, see AWS Support.


About the author

Enter image description here

Eason Cao

Eason is a Senior Cloud Support Engineer with over 5 years of industry experience specializing in AWS container solutions. As a subject matter expert in container services at AWS, he is dedicated to assisting customers in overcoming cloud environment challenges and optimizing distributed systems.

AWS OFFICIAL
AWS OFFICIALUpdated a month ago1309 views