How do I use SSM Agent logs to troubleshoot issues with SSM Agent in my managed instance?

8 minuti di lettura
0

AWS Systems Manager Agent (SSM Agent) fails to run successfully, but I don't know how to troubleshoot the issue using the SSM Agent logs.

Short description

SSM Agent runs on your managed Amazon Elastic Compute Cloud (Amazon EC2) instance and processes requests from the AWS Systems Manager service. SSM Agent requires that the following conditions are met:

  • SSM Agent must connect to the required service endpoints.
  • SSM Agent requires AWS Identity and Access Management (IAM) permissions to call the Systems Manager API calls.
  • Amazon EC2 must assume valid credentials from the IAM instance profile.

If any of these conditions aren't met, then SSM Agent fails to run successfully.

To identify the root cause of the SSM Agent failure, review SSM Agent logs in the following locations:

Linux

/var/log/amazon/ssm/amazon-ssm-agent.log
/var/log/amazon/ssm/errors.log

Windows

%PROGRAMDATA%\Amazon\SSM\Logs\amazon-ssm-agent.log
%PROGRAMDATA%\Amazon\SSM\Logs\errors.log

Note: Because SSM Agent is updated frequently with new capabilities, it's a best practice to configure automated updates for SSM Agent.

Resolution

First, review the logs and identify whether the issue is caused by missing endpoint connections, missing permissions, or missing credentials. Then, follow the relevant troubleshooting steps for your issue.

SSM Agent can't talk to the required endpoints

SSM Agent can't reach the metadata service

When SSM Agent can't reach the metadata service, it also can't locate the AWS Region information, IAM role, or instance ID from that service. In this case, you see an error message in the SSM Agent logs that's similar to the following:

"INFO- Failed to fetch instance ID. Data from vault is empty. RequestError: send request failed caused by: Get http://169.254.169.254/latest/meta-data/instance-id"

The most common reason for this error is using a proxy for outbound internet connections from your instance without configuring SSM Agent for a proxy. Be sure to configure SSM Agent to use a proxy.

On Windows instances, this error might also occur from a misconfigured persistent network route when you use a custom AMI to launch your instance. You must verify that the route for the metadata service IP points to the correct default gateway.

To verify if metadata is activated for your instance, run the following command in the AWS Command Line Interface (AWS CLI). Be sure to replace i-1234567898abcdef0 with your instance ID:

Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.

aws ec2 describe-instances --instance-ids i-1234567898abcdef0 --query 'Reservations[*].Instances[*].MetadataOptions'

You receive an output that's similar to the following:

[
  [{
    "State": "applied",
    "HttpTokens": "optional",
    "HttpPutResponseHopLimit": 1,
    "HttpEndpoint": "enabled",
    "HttpProtocolIpv6": "disabled",
    "InstanceMetadataTags": "disabled"
  }]
]

In this output, "HttpEndpoint": "enabled" indicates that metadata is activated for your instance.

If metadata isn't activated, then you can turn it on with the aws ec2 modify-instance-metadata-options command. For more information, see Modify instance metadata options for existing instances.

SSM Agent can't reach Systems Manager service endpoints

If SSM Agent can't connect with service endpoints, then SSM Agent fails. SSM Agent must make an outbound connection with the following Systems Manager service API calls on port 443:

  • SSM endpoint: ssm.REGION.amazonaws.com
  • EC2 messaging endpoint: ec2messages.REGION.amazonaws.com
  • SSM messaging endpoint: ssmmessages.REGION.amazonaws.com

Note: SSM Agent uses the Region information that the instance metadata service retrieves to replace the REGION value in these endpoints.

When SSM Agent can't connect with the Systems Manager endpoints, you see error messages similar to the following in the SSM Agent logs:

"ERROR [HealthCheck] error when calling AWS APIs. error details - RequestError: send request failed caused by: Post https://ssm.ap-southeast-2.amazonaws.com/: dial tcp 172.31.24.65:443: i/o timeout"

"DEBUG [MessagingDeliveryService] RequestError: send request failed caused by: Post https://ec2messages.ap-southeast-2.amazonaws.com/: net/http: request cancelled while waiting for connection (Client.Timeout exceeded while awaiting headers)"

The following are some common reasons why SSM Agent can't connect with the Systems Manager API endpoints on port 443:

  • Instance egress security group rules don't allow outgoing connections on port 443.
  • Virtual private cloud (VPC) endpoint ingress and egress security group rules don't allow incoming and outgoing connections to the VPC interface endpoint on port 443.
  • When the instance lives in a public subnet, routing table rules aren't configured to direct traffic using an internet gateway.
  • When the instance lives in a private subnet, routing table rules aren't configured to direct traffic using a NAT gateway or VPC endpoint.
  • If routing table rules are configured to use a proxy for all outgoing connections, then SSM Agent isn't configured to use a proxy.

SSM Agent doesn't have permissions to call the required Systems Manager API calls

SSM Agent failed to register itself as online on Systems Manager because SSM Agent isn't authorized to make UpdateInstanceInformation API calls to the service.

The UpdateInstanceInformation API call must maintain a connection with SSM Agent so that the service knows that SSM Agent is functioning as expected. SSM Agent calls the Systems Manager service in the cloud every five minutes to provide health check information. If SSM Agent doesn't have the correct IAM permissions, then you see an error message in the SSM Agent logs.

If SSM Agent uses the incorrect IAM permissions, then you see an error that's similar to the following:

"ERROR [instanceID=i-XXXXX] [HealthCheck] error when calling AWS APIs. error details - AccessDeniedException: User: arn:aws:sts::XXX:assumed-role/XXX /i-XXXXXX is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:ap-southeast-2:XXXXXXX:instance/i-XXXXXX
status code: 400, request id: XXXXXXXX-XXXX-XXXXXXX
INFO [instanceID=i-XXXX] [HealthCheck] increasing error count by 1"

If SSM Agent doesn't have any IAM permissions, then you see an error that's similar to the following:

"ERROR [instanceID=i-XXXXXXX] [HealthCheck] error when calling AWS APIs. error details - NoCredentialProviders: no valid providers in chain. Deprecated. For verbose messaging see aws.Config.CredentialsChainVerboseErrors
2018-05-08 10:58:39 INFO [instanceID=i-XXXXXXX] [HealthCheck] increasing error count by 1"

Verify that the IAM role that's attached to the instance contains the required permissions to allow an instance to use Systems Manager service core functionality. Or, if an instance profile role isn't already attached, then attach an instance profile role and include AmazonSSMManagedInstanceCore permissions.

For more information about the required IAM permissions for Systems Manager, see Additional policy considerations for managed instances.

Systems Manager API call throttling

If a high volume of managed instances that run SSM Agent make concurrent UpdateInstanceInformation API calls, then those calls might get throttled.

If the UpdateInstanceInformation API call for your instance is throttled, then you see error messages similar to the following in the SSM Agent logs:

"INFO [HealthCheck] HealthCheck reporting agent health.
ERROR [HealthCheck] error when calling AWS APIs. error details - ThrottlingException: Rate exceeded
status code: 400, request id: XXXXX-XXXXX-XXXX
INFO [HealthCheck] increasing error count by 1"

Use the following troubleshooting steps to prevent ThrottlingException errors:

  • Reduce the frequency of API calls.
  • Implement error retries and exponential backoffs when you make API calls.
  • Stagger the intervals of API calls so that they don't all run at the same time.
  • Request a throttling limit increase for UpdateInstanceInformation API calls.

Amazon EC2 can't assume valid credentials from the IAM instance profile

If Amazon EC2 can't assume the IAM role, then you see a message that's similar to the following example in the SSM Agent logs:

2023-01-25 09:56:19 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity
2023-01-25 09:56:19 INFO [CredentialRefresher] Sleeping for 1s before retrying retrieve credentials
2023-01-25 09:56:20 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity
2023-01-25 09:56:20 INFO [CredentialRefresher] Sleeping for 2s before retrying retrieve credentials
2023-01-25 09:56:22 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity
2023-01-25 09:56:22 INFO [CredentialRefresher] Sleeping for 4s before retrying retrieve credentials
2023-01-25 09:56:26 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity
2023-01-25 09:56:26 INFO [CredentialRefresher] Sleeping for 9s before retrying retrieve credentials
2023-01-25 09:56:35 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity
2023-01-25 09:56:35 INFO [CredentialRefresher] Sleeping for 17s before retrying retrieve credentials
2023-01-25 09:56:52 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity
2023-01-25 09:56:52 INFO [CredentialRefresher] Sleeping for 37s before retrying retrieve credentials

If you try to retrieve metadata from the EC2 instance, then you also see an error that's similar to the following example:

# curl http://169.254.169.254/latest/meta-data/iam/security-credentials/profile-name
{
  "Code" : "AssumeRoleUnauthorizedAccess",
  "Message" : "EC2 cannot assume the role profile-name. Please see documentation at https://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_iam-ec2.html#troubleshoot_iam-ec2_errors-info-doc.",
  "LastUpdated" : "2023-01-25T09:57:56Z"
}

Note: In this example, profile-name is the name of the instance profile.

To troubleshoot this error, check the trust policy that's attached to the IAM role. In the policy, you must specify Amazon EC2 as a service that's allowed to assume the IAM role. Update your IAM policy through the UpdateAssumeRolePolicy API so that it appears similar to the following example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": ["ec2.amazonaws.com"]
      },
      "Action": ["sts:AssumeRole"]
    }
  ]
}

For more information, see The iam/security-credentials/[role-name] document indicates "Code":"AssumeRoleUnauthorizedAccess".


AWS UFFICIALE
AWS UFFICIALEAggiornata un anno fa