Support Automation Workflow (SAW) Runbook: Troubleshoot Amazon CloudWatch Agent

8 minuti di lettura
Livello di contenuto: Intermedio
0

How can I use the AWSSupport-TroubleshootCloudWatchAgent automation runbook to troubleshoot Amazon CloudWatch Agent, or to collect the required support logs from the instance?

In this article, I will show you how to use the AWSSupport-TroubleshootCloudWatchAgent, AWS Systems Manager automation runbook to troubleshoot common issues with the Amazon CloudWatch Agent running on an Amazon EC2 instance. This automation runbook can be used to collect, analyze, and upload logs and other configuration files to Amazon Simple Storage Service (Amazon S3) bucket of your choice.

Learn more about Support Automation Workflows >>

How it works?

The runbook AWSSupport-TroubleshootCloudWatchAgent contains a series of basic, and advanced checks. The basic checks are compatible with all Linux and Windows operating systems, whereas the advanced checks are compatible with the same operating systems, but require that the Amazon EC2 instance is managed by AWS Systems Manager.

The basic checks include the following troubleshooting steps:

  • Checks the selected Amazon EC2 instance for an attached AWS Identity and Access Management (IAM) instance profile.
  • Verifies the required permissions policies are applied to the IAM instance profile.

The advanced checks extend upon the basic checks above, adding the following additional features:

  • Checks the status of the Amazon CloudWatch Agent process on your Amazon EC2 instance.
  • Analyzes the Amazon CloudWatch Agent logs for any errors, and provides targeted guidance to resolve any indicated issues.
  • Bundles and zips any relevant configuration files and the Amazon CloudWatch Agent logs on the instance.

Note 1: An optional input runbook parameter S3UploadBucket is provided to give you the ability to upload these logs directly from your Amazon EC2 instance to an Amazon S3 bucket. If this parameter is given, it will cleanup the created zip file after upload.

  • Performs connection tests on the instance to the required service endpoints.

Note 2: This test targets the Amazon CloudWatch Logs and Amazon CloudWatch endpoints by default. If your CloudWatch Agent configuration file uses the append_dimensions option, we recommend that you set the runbook input parameter CheckEC2Endpoint to true. This ensures that a connection test to the Amazon EC2 endpoint is performed, as this is required for the aforementioned configuration directive. Additionally, if the runbook input parameter RunVpcReachabilityAnalyzer is set to true, and no endpoint connection failures are detected from this step, the automation will skip AnalyzeAWSEndpointReachabilityFromEC2 step to save you from incurring any costs.

When the runbook input parameter RunVpcReachabilityAnalyzer is set to true, it will run AnalyzeAWSEndpointReachabilityFromEC2 step if one of two criteria are satisfied:

If either of this occur, the runbook can help you troubleshoot DNS issues, Amazon Virtual Private Cloud (Amazon VPC) endpoint connectivity, or to verify reachability from your Amazon EC2 to outbound networking resources.

Note 3: AnalyzeAWSEndpointReachabilityFromEC2 does not directly perform any connectivity checks to the required endpoints. Instead, it uses Reachability Analyzer which contains an associated cost. For this reason, if you have a complex Amazon VPC routing setup that includes peering connections or resources such as transit gateways, this may not provide an accurate assessment of the reachability to the specified AWS endpoints. For pricing details of Reachability Analyzer, please see Amazon VPC Pricing.

Required IAM permissions

The AutomationAssumeRole parameter requires the following actions to successfully use the runbook:

  • iam:SimulatePrincipalPolicy
  • iam:GetContextKeysForPrincipalPolicy
  • iam:GetInstanceProfile
  • iam:ListAttachedRolePolicies
  • iam:PassRole
  • s3:GetBucketAcl
  • s3:PutObject
  • s3:GetBucketPolicyStatus
  • s3:ListBucket
  • ssm:GetAutomationExecution
  • ssm:GetCommandInvocation
  • ssm:DescribeInstanceInformation
  • ec2:DescribeInstances

To run AnalyzeAWSEndpointReachabilityFromEC2 step, make sure your IAM user or the role has the permissions listed in the AnalyzeAWSEndpointReachabilityFromEC2 required IAM permissions section.

Instructions

  1. Navigate to the AWSSupport-TroubleshootCloudWatchAgent in the AWS Systems Manager console.
  2. Click on Execute automation.
  3. For the input parameters enter the following:
    • AutomationAssumeRole (optional): This is the Amazon Resource Name (ARN) of the AWS Identity and Access Management (IAM) role that allows Systems Manager Automation to perform the actions on your behalf. If no role is specified, Systems Manager Automation will use the permissions of the user that starts this runbook.
    • InstanceId (required): The ID of the Amazon EC2 instance you want to troubleshoot the Amazon CloudWatch Agent on.
    • S3UploadBucket (optional): The name of an Amazon S3 bucket to upload the collected Amazon CloudWatch Agent logs. The Amazon EC2 instance profile must have correct permissions to upload files to this bucket. This also requires the target Amazon EC2 instance to be AWS Systems Manager managed instance.
    • S3BucketOwnerAccountId (optional): The AWS Account Number that owns the Amazon S3 bucket where you want to upload the Amazon CloudWatch Agent logs. If you do not modify this parameter, the runbooks uses the AWS account ID of the user or role in which the automation runs.
    • CheckEC2Endpoint (optional): Specify true if your agent configuration uses the option append_dimensions to append Amazon EC2 metric dimensions to the metrics collected by the agent. When append_dimensions is used, the Amazon CloudWatch Agent requires connectivity to the Amazon EC2 endpoint, so an additional connectivity tests will be performed via the extended checks. Default value is false.
    • RunVpcReachabilityAnalyzer (optional): Specify true to run the AnalyzeAWSEndpointReachabilityFromEC2 automation step if a network issue is determined by the extended checks, or if the instance ID specified is not a managed instance. Default value is false.
    • RetainVpcReachabilityAnalysis (optional): Only relevant if RunVpcReachabilityAnalyzer is true. Specify true to retain the network insight path and related analyses created by VPC Reachability Analyzer. By default, those resources are deleted after successful analysis. If you choose to retain the analysis, the you can visualize it in the VPC console. The console link will be available in the AnalyzeAWSEndpointReachabilityFromEC2 automation output. Default value is false.

The following example demonstrates how to use the AWSSupport-TroubleshootCloudWatchAgent automation runbook to troubleshoot why your Amazon CloudWatch Agent is not running correctly.

The runbook input parameters

  1. Click on Execute.
  2. You should see that the automation has been initiated.
  3. Document will perform the following steps:
  • getInstanceProfile: Verifies if the provided Amazon EC2 instance has an IAM instance profile attached.
  • branchOnInstanceProfileStatus: Branches the automation to check for necessary instance profile permissions if the instance profile is attached to the instance.
  • verifyIamPermissions: Checks the instance profile associated with the instance to determine if the necessary permissions are applied.
  • getInstanceInformation: Checks if the instance has an active AWS Systems Manager agent, and fetches the OS type of the instance.
  • branchOnManagedInstance: Branches the automation to perform extended checks if the instance is managed.
  • getAgentStatus: Checks the status of the Amazon CloudWatch Agent on the instance.
  • branchOnInstanceOsType: Branches the automation to run a specific log collection/analysis command based on the OS.
  • analyzeLogs: Analyzes and outputs findings of Amazon CloudWatch Agent logs on Linux OS.
  • collectLogs: Bundles and outputs the relevant Amazon CloudWatch Agent troubleshooting files on Linux OS.
  • checkEndpointReachability: Checks if the instance can reach the required endpoints on Linux OS.
  • analyzeLogsWindows: Analyzes and outputs findings of Amazon CloudWatch Agent logs on Windows OS.
  • collectLogsWindows: Bundles and outputs the relevant Amazon CloudWatch Agent troubleshooting files on Windows OS.
  • checkEndpointReachabilityWindows: Checks if the instance can reach the required endpoints on Windows OS.
  • branchOnRunVpcReachabilityAnalyzer: Branches the automation to run a specific log collection/analysis command based on the OS.
  • generateEndpoints: Generates an endpoint to check from the extended check failures and the value of CheckEC2Endpoint
  • analyzeAwsEndpointReachabilityFromEC2: Calls the automation runbook AWSSupport-AnalyzeAWSEndpointReachabilityFromEC2 to check the reachability of the selected instance to the required endpoints.
  • outputFindings: Output results of the automation execution steps.
  1. Once completed, you can review the Outputs section for the detailed results of the execution:

Output of the runbook execution

If the AnalyzeAWSEndpointReachabilityFromEC2 step is ran, you will see a URL for the referenced automation output in the execution Outputs section. Navigate to this URL to see further details of this referenced automation.

Enter image description here

Output when input parameter RunVpcReachabilityAnalyzer is set to true.

Enter image description here

Output when one of the endpoint requests in the checkEndpointReachability resulted in a connection failure in a subsequent execution of the runbook. See Note 2 above for more information on when this step is ran.

Enter image description here

Enter image description here

Conclusion

In this article, I demonstrated how to troubleshoot Amazon CloudWatch agent issues on your Amazon EC2 instance using the automation runbook AWSSupport-TroubleshootCloudWatchAgent, available in the AWS System Manager.

References

Systems Manager Automation

Run this Automation (console)

Running a simple automation: https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-working-executing.html

Setting up Automation: https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-setup.html

Documentation related to the AWS service

Use the following information to help troubleshoot problems with the CloudWatch agent.

To help you troubleshoot, remediate, manage, and reduce costs on your AWS resources, AWS Support maintains a subset of the AWS provided predefined runbooks . These runbooks are prefixed with “AWSSupport-“ or “AWSPremiumSupport-“.