How do I troubleshoot common permission errors for an EC2 instance that has CloudWatch installed?

10 minute read
0

I want to troubleshoot common permission errors for Amazon CloudWatch on an Amazon Elastic Compute Cloud (Amazon EC2) instance.

Short description

The following are the most common CloudWatch permission errors on an Amazon EC2 instance:

  • Unattached AWS Identity and Access Management (IAM) roles
  • Missing permissions in attached IAM roles
  • No access to CloudWatch service endpoints
  • Stuck profile association states
  • Incorrectly configured credentials
  • Failed to retrieve metadata information
  • Missing certificate bundles

Resolution

Unattached IAM roles

If you have unattached IAM roles, then you might receive the following error in the CloudWatch log file amazon-cloudwatch-agent.log:

"Failed to get credential from session: NoCredentialProviders: no valid providers in chain caused by: EnvAccessKeyNotFound: failed to find credentials in the environment."

To run the CloudWatch agent, make sure that your instance is correctly associated with an IAM role. Use the AWS managed policy CloudWatchAgentServerPolicy to create the IAM role required for each server to run the CloudWatch agent. If the agent sends logs to CloudWatch Logs and you want to set log group retention policies, then add the logs:PutRetentionPolicy permission to the role.

For more information, see Create IAM roles and users for use with the CloudWatch agent.

Missing permissions in attached IAM roles

Make sure that the attached IAM roles contain the following permissions for the CloudWatch agent:

"Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData",
                "ec2:DescribeVolumes",
                "ec2:DescribeTags",
                "logs:PutLogEvents",
                "logs:DescribeLogStreams",
                "logs:DescribeLogGroups",
                "logs:CreateLogStream",
                "logs:CreateLogGroup",
                "logs:PutRetentionPolicy",
                "xray:PutTraceSegments",
                "xray:PutTelemetryRecords",
                "xray:GetSamplingRules",
                "xray:GetSamplingTargets",
                "xray:GetSamplingStatisticSummaries"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:GetParameter"
            ],
            "Resource": "arn:aws:ssm:*:*:parameter/AmazonCloudWatch-*"
        }
    ]
}

No access to CloudWatch service endpoints

The connection to service endpoints can be established through an internet connection or Amazon Virtual Private Cloud (Amazon VPC) endpoints. To transmit metrics and logs, the CloudWatch agent must maintain connectivity with specific CloudWatch service endpoints.

  • For metrics, the CloudWatch agent must have connectivity with the https://ec2.example-region.amazonaws.com and https://monitoring.example-region.amazonaws.com service endpoints.
  • For logs, the CloudWatch agent must have connectivity with the https://logs.example-region.amazonaws.com service endpoint.

If you have no access to CloudWatch service endpoints, then you might receive the following error or similar in the CloudWatch agent log file amazon-cloudwatch-agent.log:

"2023-12-30T02:56:37Z W! {"caller":"ec2tagger/ec2tagger.go:485","msg":"ec2tagger: Unable to describe ec2 tags for initial retrieval","kind":"processor","name":"ec2tagger","pipeline":"metrics/host","error":"RequestError: send request failed\caused by: Post \"https://ec2.us-west-1.amazonaws.com/\": dial tcp 176.32.118.30:443: i/o timeout"}"

To push metrics from an Amazon EC2 instance that doesn't have a connection either set up a NAT gateway or create an Amazon VPC endpoint. If you create an Amazon VPC endpoint, then make sure that you test the connectivity. Also, for every Availability Zone that accesses CloudWatch, select one subnet.

Note: If you set up a NAT gateway or create an Amazon VPC endpoint, then additional charges are incurred. For more information, see AWS PrivateLink pricing and Amazon VPC pricing.

Stuck profile association states

Stuck profile association states occur when you update an instance profile and one role is disassociating while another role is associating. If the old role doesn't disassociate completely, then your profile association states are stuck.

To list the associations for the instance, run the following command:

Note: Replace example-instance-id with the required instance ID and example-region with the required AWS Region.

$ aws ec2 describe-iam-instance-profile-associations --filters Name=instance-id,Values=example-instance-id --region example-region

Example output:

{
    "IamInstanceProfileAssociations": [
        {
            "AssociationId": "iip-assoc-aaaaaaaaa",
            "InstanceId": "i-08c9a2ccvssdfgge",
            "IamInstanceProfile": {
                "Arn": "arn:aws:iam::1234567890:instance-profile/CloudWatch_Monitoring",
                "Id": "AIPAdddddsgggggg"
            },
            "State": "disassociating"
        },
        {
            "AssociationId": "iip-assoc-bbbbbbb",
            "InstanceId": "i-08c9a2ccvssdfgge",
            "IamInstanceProfile": {
                "Arn": "arn:aws:iam::1234567890:instance-profile/Grafana",
                "Id": "AIPA3gggdddddddgg"
            },
            "State": "associating"
        }
        ]
    }

To resolve stuck profile association states, use either the AWS Management Console or AWS Command Line Interface (AWS CLI).

Use the AWS Management Console

  1. Detach the IAM role from the instance.
  2. Reattach the IAM role to the instance.

Use the AWS CLI

Note: If you receive errors when you run AWS CLI commands, then see Troubleshoot AWS CLI errors. Also, make sure that you're using the most recent AWS CLI version.

  1. Disassociate all associations one at a time. Run the following command for each association that you want to disassociate.
    Note: Replace example-association-id with the association ID that you want to disassociate.

    $ aws ec2 disassociate-iam-instance-profile --association-id example-association-id
  2. List the associations for the instance profile to make sure that all required associations are disassociated:
    Note: Replace example-instance-id with the required instance ID and example-region with the required AWS Region.

    $ aws ec2 describe-iam-instance-profile-associations --filters Name=instance-id,Values=example-instance-id --region example-region
  3. Attach the IAM profile:
    Note: Replace example-instance-id with the required instance ID and example-profile-name with the required profile name.

    aws ec2 associate-iam-instance-profile \
        --instance-id example-instance-id \
        --iam-instance-profile Name=example-profile-name

Incorrectly configured credentials

CloudWatch agent searches for the following types of credentials in order:

  • env
  • assume-role
  • assume-role-with-web-identity
  • sso
  • shared-credentials-file
  • custom-process
  • config-file
  • ec2-credentials-file
  • boto-config
  • container-role
  • iam-role

Note: The iam-role credential is last in the list of credentials that the CloudWatch agent searches for. If other credentials are found before iam-role, then the iam-role credential isn't used.

To check which credentials are in use, run the following command on the instance:

$ aws sts get-caller-identity --debug

To resolve the issue of the iam-role credential not used, either attach the appropriate policy to the detected credentials or remove the detected credentials. When you remove the detected credentials, the iam-role credential becomes the primary credential.

Failed to retrieve metadata information

If the CloudWatch agent fails to retrieve metadata information, you might receive an EC2MetadataError in the log file amazon-cloudwatch-agent.log. To resolve this error, complete the following steps:

Check the metadata accessibility

  • To check the metadata accessibility, run the following commands:

    ```bash
         curl http://169.254.169.254/latest/meta-data/iam/security-credentials
         curl http://169.254.169.254/latest/meta-data/identity-credentials/ec2/security-credentials/ec2-instance 
         curl http://169.254.169.254/latest/meta-data/identity-credentials/ec2/security-credentials
         curl http://169.254.169.254/latest/meta-data
         ```
  • For different metadata versions, run the following commands:
    IMDSv1:

    wget -q -O - http://169.254.169.254/latest/meta-data/instance-id  
    curl http://169.254.169.254/

    IMDSv2:

    TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"` \
    && curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/

Check the network settings

Check the route tables to determine that you don't have a route created for the metadata IP address 169.254.169.254:

  • For Linux, run the following commands:
    Check the OS kernel level route tables:

    sudo route -n
    netstat -rn
    ip route list

    Check the firewall:

    sudo iptables -L

    If a route is missing, then add a route to the kernel level route table:
    Note: Replace example-gateway with the required gateway.

    sudo ip route add 169.254.169.254 via <example-gateway>
  • For Windows, run the following commands:
    Check the route tables:

    ipconfig /all
    route print

    If a route is missing, then add the static routes:
    Note: Replace example-gateway with the required gateway IP address.

    route -p ADD 169.254.169.251 MASK 255.255.255.255 <gateway-ip>
           route -p ADD 169.254.169.250 MASK 255.255.255.255 <example-gateway>
           route -p ADD 169.254.169.254 MASK 255.255.255.255 <example-gateway>
           route -p ADD 169.254.169.249 MASK 255.255.255.255 <example-gateway>
           route -p ADD 169.254.169.123 MASK 255.255.255.255 <example-gateway>
           route -p ADD 169.254.169.253 MASK 255.255.255.255 <example-gateway>

Check the PUT Response Hop Limit for IMDS

By default, the HttpPutResponseHopLimit is set to 1. This allows the packet to be forwarded by only one router or hop and for the request to remain within the instance's local network. If the following scenarios are present, then adjust the hop limit:

  • If you have IMDS requests that pass through more than one network hop, then adjust your HttpPutResponseHopLimit.
  • If you use container orchestration systems on Amazon EC2 instances and additional hops are introduced for metadata service requests, then adjust the hop limit. Adjust the hop limit so that containers can still access IMDS.
  • If you forward or route requests through specific network paths for inspection or logging, then adjust your hop limit.
  • If network issues cause failures when you access IMDS, then adjust the hop limit for diagnostic purposes. Make sure that you roll back the hop limit value when troubleshooting is complete.

Check the proxy settings

Proxy settings might interfere with the CloudWatch agent's ability to retrieve metadata information. If a proxy is configured, then update the common-config.toml with the appropriate proxy values.

Example proxy settings:

```toml
   [proxy]
   # http_proxy = "{example-http-url}"
   # https_proxy = "{example-https-url}"
   no_proxy = "169.254.169.254"

Check the version compatibility with IMDSv2

Versions earlier than CloudWatch agent 1.23 don't support IMDSv2. If you run an older version of the CloudWatch agent, then update the CloudWatch agent to the latest version.

To check the version of your CloudWatch agent, run the following commands:

For Linux:

```bash
     sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a status
     ```

For Windows:

```powershell
     & $Env:ProgramFiles\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent-ctl.ps1 -m ec2 -a status
     ```

Check if the CloudWatch agent assumes it's running on-premises

If the CloudWatch agent is installed on Amazon EC2 docker containers and fails to retrieve the metadata information, then the agent assumes it's running on premises. You might receive the following error:

"2023/05/19 11:43:50 I! access ECS task metadata fail with response unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers), assuming I'm not running in ECS.
I! Detected the instance is OnPremise"

This error occurs when the CloudWatch agent tries to locate the credentials in the /root/.aws/credentials path on a Linux machine and has a timeout. This timeout occurs when the agent tries to connect to the metadata endpoint IP for Amazon Elastic Container Service (Amazon ECS) instead of Amazon EC2. To resolve this, complete one of the following:

  • Use the full version of the CloudWatch agent that's installed on the underlying OS. This allows you to bypass the cluster setup.
  • Use the OnPremise option of the CloudWatch agent in CloudWatch Container Insights and make sure that the correct AWS credentials are provided. This prevents metadata calls to the Amazon ECS endpoint.

Missing certificate bundles

When the certificate bundle doesn't include the necessary Amazon root certificate, then the Amazon EC2 client encounters trust issues. You might receive the following error in the CloudWatch agent log file:

"caused by: Post "https://ec2.ca-central-1.amazonaws.com/": x509: certificate signed by unknown authority, metrics will be dropped until it got fixed 2023-08-08T17:24:10Z E! refresh EC2 Instance Tags failed: RequestError: send request failed caused by: Post "https://ec2.ca-central-1.amazonaws.com/": x509: certificate signed by unknown authority, metrics will be dropped until it got fixed 2023-08-08T17:34:10Z E! refresh EC2 Instance Tags failed: RequestError: send request failed caused by: Post "https://ec2.ca-central-1.amazonaws.com/": x509: certificate signed by unknown authority, metrics will be dropped until it got fixed"

To inspect the certificate bundles, run the following commands:

  • For Linux, certificates are stored in /etc/ssl/certs/. To inspect the primary certificate bundle, run the following command:

    ``bash
        cat /etc/ssl/certs/ca-certificates.crt
        ``
  • For applications that have their own certificates, run the following command:
    Note: Replace example-region with the required AWS Region.

    ``bash
        curl --cacert /path/to/ca-bundle.crt -Iv https://ec2.example-region.amazonaws.com/
        ``
  • For Windows, certificates are managed through the Microsoft Management Console (MMC). To access the MMC from a Windows machine, press the Windows icon button and R at the same time. When the prompt appears, type mmc, and then select Enter.

Note: If the Amazon root certificate is missing, then add it to your system or application's certificate bundle.

AWS OFFICIAL
AWS OFFICIALUpdated 2 months ago