Troubleshooting managed node issues in Systems Manager with SAW
This article introduces an example architecture to monitor and automatically analyze AWS Systems Manager managed node issues using an AWS Support Automation Workflow (SAW) runbook.
Introduction
AWS Support engineers often see customers reporting issues related to their Amazon Elastic Compute Cloud (Amazon EC2) instances that aren't registered as a managed node in Systems Manager. Checking security groups, network settings, and permissions to resolve these issues can be time-consuming.
AWS Support Engineering created SAW to assist with troubleshooting, diagnosis, and remediation of common issues with AWS resources. The SAW framework helps you reduce the time that you take to troubleshoot by removing the typical manual operations that you otherwise need.
In this article, you will learn how to use SAW to automate your troubleshooting process. You will also learn how to configure your architecture with SAW to monitor and automatically analyze managed node issues in Systems Manager.
Solution overview
The first part of this solution presents information on how to use a SAW runbook to troubleshoot an issue with your EC2 instance that’s not registering as a managed node in Systems Manager. The second part shows how you can configure your architecture to automate this troubleshooting process and accelerate issue resolution.
Part 1 - Identify the root cause with SAW
To determine why Systems Manager doesn't show a managed instance from Amazon EC2, complete the following steps:
1. Use the AWSSupport-TroubleshootManagedInstance runbook. For more information, see How can I troubleshoot why Systems Manager doesn’t show an Amazon EC2 instance as a managed instance?
2. After the automation completes, review the Outputs section for detailed results.
For example, if the issue is caused because the AWS Identity and Access Management (IAM) instance profile doesn’t have the required permissions, then the Outputs section shows the following details:
3. Fix any issues that you identify from the results.
For example, to fix the preceding issue, make sure that you add the required permissions to the IAM instance profile. Then, check if the EC2 instance is registered as a managed node in Systems Manager. To do this, run the AWS Command Line Interface (AWS CLI) command describe-instance-information:
aws ssm describe-instance-information --filters "Key=InstanceIds,Values=${example-instance}"
Replace example-instance in the preceding command with the instance ID of the EC2 instance.
Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI. If the command runs successfully and retrieves details of your instance, then it shows your instance as a managed node in Systems Manager.
Part 2 - Automate issue detection with SAW
You can configure your architecture to use SAW to automatically detect an issue with Systems Manager managed nodes and determine the root cause.
Make sure that you meet the following prerequisites:
- You installed and configured the AWS SAM CLI on a local development workstation.
- You activated the notification settings.
You can activate your notification settings in one of the following ways:
- Subscribe your email address to an Amazon Simple Notification Service (Amazon SNS) topic that you created.
- You activated the notification settings by building a workflow in Slack with webhooks.
If you’re building a workflow in Slack with webhooks, complete the following steps to set up your custom variables:
- Choose Edit next to Starts when an app or service sends a web request.
- Under Set up variables, choose Add Variable.
- For Key, enter main. For Data type, enter text. Then, choose Save.
- Choose Add Variable.
- For Key, enter thread. For Data type, enter text. Then, choose Save.
- Choose Edit next to Send this message to. On the Send a message page, for Send this message to;, select the Slack channel where you want to receive the notifications. Then, choose Insert a variable. For Message text, select main. Then, choose Save.
- Choose Edit next to Send this message to. On the Send a message page, for Send this message to;, select Message thread. Then, choose Insert a variable. For Message text, select thread.
To see the sample code of the walkthrough, see AWS SAW Monitoring And Automatic Analysis Architecture on the GitHub website.
The following diagram illustrates the high-level architecture of the suggested solution.
The architecture includes the following components:
Monitoring: Amazon EventBridge detects the launch of your EC2 instance. In EventBridge, an event pattern matches the EC2 instance RUNNING status, and then starts the AWS Step Functions state machine.
# Event pattern
{
"detail-type": ["EC2 Instance State-change Notification"],
"source": ["aws.ec2"],
"detail": {
"state": ["running"]
}
}
After you launch the EC2 instance, the Systems Manager Agent might take up to 5 minutes to start. Because of this, the state machine waits for a few minutes before performing the analysis.
Analysis: Step Functions completes the following steps:
- Run the SAW runbook AWSSupport-TroubleshootManagedInstance when the EC2 instance isn’t registered as a managed node.
- To check if the SAW analysis is complete, call the DescribeAutomationExecutions API on a regular basis.
- Invoke an AWS Lambda function after the SAW analysis is completed.
Notification: Lambda formats the strings for notifications. Then, it sends notifications through Slack or email based on your configuration.
Solution walkthrough
This section covers the solution walkthrough for automatically detecting the managed node issues with SAW. To see the sample code of the walkthrough, see aws-samples on the GitHub website.
1. Run the following commands to register a SlackWebHookUrl in AWS Secrets Manager:
$ export SLACK_WEB_HOOK_URL="YOUR_SLACK_WEB_HOOK_URL"
$ export SECRET_NAME="YOUR_SECRET_NAME"
$ aws secretsmanager create-secret --name ${SECRET_NAME} --secret-string ${SLACK_WEB_HOOK_URL}
2. Clone the repository to a local development workstation:
$ git clone https://github.com/aws-samples/introducing-monitoring-and-automatic-analysis-architecture-using-aws-saw.git
$ cd introducing-monitoring-and-automatic-analysis-architecture-using-aws-saw/
3. Build and deploy the Lambda function, EventBridge rule, Step Functions state machine, and the related IAM roles that are defined in the AWS SAM template template.yaml.
Note: Enter parameters in the deployment wizard, such as the Amazon SNS topic’s ARN, SECRET_NAME in Secrets Manager, or both. If you encrypted the SNS topic with AWS Key Management System (AWS KMS), then specify the AWS KMS key’s ARN as a parameter.
$ sam build
$ sam deploy –guided
Configuring SAM deploy
======================
Looking for config file [samconfig.toml] : Found
Reading default arguments : Success
Setting default arguments for 'sam deploy'
=========================================
Stack Name [sam-app]:
AWS Region [ap-northeast-1]:
Parameter SecretsManagerNameForSlackWebHookUrl [SLACK_WEB_HOOK_URL]:
Parameter TopicArn [arn:aws:sns:ap-northeast-1:<ACCOUNT_ID>:kms-topic]:
Parameter TopicKmsKeyArn [arn:aws:kms:ap-northeast-1:<ACCOUNT_ID>:key/<ID>]:
・・・
4. Test the architecture. To do this, launch an EC2 instance with an IAM instance profile that has no Systems Manager permissions. Systems Manager can’t register this EC2 instance as a managed node because of lack of permissions.
5. Check that you receive the SAW analysis results in your Slack or email based on your configuration, as shown in the following images. In this example, the results show that the issue is caused by the insufficient IAM instance profile permissions. To resolve your issues, use the documentation links that are displayed in the analysis results.
Cleanup
Complete the following steps to delete the resources that you created for this tutorial:
- Terminate the EC2 instance that you launched for this tutorial.
- Delete the secrets that you created in Secrets Manager.
- To clean up the sample walkthrough, use the AWS SAM CLI and remove the AWS CloudFormation stack by running
$ sam delete
.
Conclusion
In this article, you learned how to use SAW to troubleshoot issues with EC2 instances that Systems Manager doesn’t register as managed nodes. You can use the example architecture introduced in this article to monitor your EC2 instances and automatically invoke SAW runbooks when these instances aren’t properly registered. This prevents you from losing visibility into your infrastructure and assists with proactive troubleshooting.
The techniques discussed in this article can help you maintain an accurate view of your EC2 instances in Systems Manager through automated issue detection and remediation.
To learn more, see Using AWS Support self-service runbooks and AWS Support Automation Workflows (SAW).
AWS Support engineers and Technical Account Managers (TAMs) can help you with general guidance, best practices, troubleshooting, and operational support on AWS. To learn more about our plans and offerings, see AWS Support.
About the author
Toshihiro Furuno
Toshihiro Furuno is a Senior Cloud Support Engineer on the AWS Deployment Support team. He is passionate about helping customers to use containers and Continuous Integration and Continuous Delivery (CI/CD). In his spare time, he enjoys playing with his sons.
Relevant content
- AWS OFFICIALUpdated 2 months ago
- asked 2 years agolg...
- asked 2 years agolg...
- asked 2 months agolg...
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 10 months ago
- AWS OFFICIALUpdated 5 months ago