How do I troubleshoot Container Insights issues for my Amazon EKS clusters?

5 minute read
0

I encounter issues when I configure Amazon CloudWatch Container Insights for my Amazon Elastic Kubernetes Service (Amazon EKS) clusters.

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

Check your Container Insights installation

To check whether you correctly installed Container Insights on your Amazon EKS cluster, run the following command:

kubectl get pods -n amazon-cloudwatch

Then, run the following command for your pod:

kubectl describe pod pod-name -n amazon-cloudwatch

Note: Replace pod-name with the pod name.

Check the Events section of the command's output.

To check your CloudWatch logs, run the following command:

kubectl logs pod-name  -n amazon-cloudwatch

Install CloudWatch Observability as an Amazon EKS managed add-on

Use the Amazon EKS add-on to install Container Insights with enhanced observability for Amazon EKS.

Note: You can use the CloudWatch Observability EKS add-on on Amazon EKS clusters that run only Kubernetes version 1.23 or later.

To install CloudWatch Observability as a self-managed add-on, complete the following steps:

  1. To install cert-manager, run the following command:

    kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.16.1/cert-manager.yaml
  2. To install the custom resource definitions (CRD), run the following command:

    curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/main/k8s-quickstart/cwagent-custom-resource-definitions.yaml | kubectl apply --server-side -f -
  3. To install the CloudWatch container agent operator, run the following command:

    ClusterName=my-cluster-name
    RegionName=my-cluster-region
    curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/main/k8s-quickstart/cwagent-operator-rendered.yaml | sed 's/{{cluster_name}}/'${ClusterName}'/g;s/{{region_name}}/'${RegionName}'/g' | kubectl apply -f -

Troubleshoot metrics that don't appear on the AWS Management Console

If you don't see Container Insights metrics on the AWS Management Console, then confirm that you completed the Container Insights setup.

Troubleshoot Container Insights errors

Unauthorized panic: Cannot retrieve cadvisor data from kubelet

To resolve this issue, make sure to activate the Webhook authorization mode in your kubelet.

Invalid endpoint error

Example error message:

"log": "2020-04-02T08:36:16Z E! cloudwatchlogs: code: InvalidEndpointURL, message: invalid endpoint uri, original error: &url.Error{Op:\"parse\", URL:\"https://logs.{{region_name}}.amazonaws.com/\", Err:\"{\"}, &awserr.baseError{code:\"InvalidEndpointURL\", message:\"invalid endpoint uri\", errs:[]error{(*url.Error)(0xc0008723c0)}}\n",

To resolve this issue, make sure that you replace all placeholder values in your commands. For example, confirm that the information that you use for cluster-name and region-name are correct for your deployment when you run the AWS CLI.

Pod metrics missing on Amazon EKS or Kubernetes after cluster upgrade

Example error message:

"W! No pod metric collected"

If your pod metrics are missing after you upgrade your cluster, then check that the container runtime on the node is working as expected.

To resolve this issue, update your deployment manifest to mount the containerd socket from the host into the container.

Example deployment manifest:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cloudwatch-agent
  namespace: amazon-cloudwatch
spec:
  template:
    spec:
      containers:
        - name: cloudwatch-agent
# ...
          # Don't change the mountPath
          volumeMounts:
# ...
            - name: dockersock
              mountPath: /var/run/docker.sock
              readOnly: true
            - name: varlibdocker
              mountPath: /var/lib/docker
              readOnly: true
            - name: containerdsock # NEW mount
              mountPath: /run/containerd/containerd.sock
              readOnly: true
# ...
      volumes:
# ...
        - name: dockersock
          hostPath:
            path: /var/run/docker.sock
        - name: varlibdocker
          hostPath:
            path: /var/lib/docker
        - name: containerdsock # NEW volume
          hostPath:
            path: /run/containerd/containerd.sock

For a full example of the manifest, see cwagent-daemonset.yaml on the GitHub website.

No pod metrics when using Bottlerocket for Amazon EKS

Example error message:

"W! No pod metric collected"

Bottlerocket uses a different containerd path on the host. If you use Bottlerocket, then you must change all volumes to the Bottlerocket container path location.

Example command:

volumes:
  # ... 
    - name: containerdsock
      hostPath:
        # path: /run/containerd/containerd.sock
        # bottlerocket does not mount containerd sock at normal place
        # https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b
        path: /run/dockershim.sock

Unexpected log volume increase from CloudWatch agent when collecting Prometheus metrics

To resolve this issue, update the CloudWatch agent to the latest available version. To find your current version, see Finding information about CloudWatch agent versions. To install the latest version, see Install the CloudWatch agent.

CrashLoopBackoff error on the CloudWatch agent

To resolve this issue, make sure that you correctly configured your AWS Identity and Access Management (IAM) permissions.

CloudWatch agent or Fluentd pod stuck in pending

You pod might be stuck in the Pending state. Or, you receive a FailedScheduling error from your CloudWatch agent or Fluentd pods. To resolve this issue, confirm that your nodes have enough compute resources based on the code quantity and RAM that the agents require.

To describe the pods, run the following command:

kubectl describe pod cloudwatch-agent-85ppg -n amazon-cloudwatch

Configmap for fluent bit not deployed correctly

To resolve this issue, confirm that you correctly deployed the fluent-bit-config config map in the amazon-cloudwatch namespace.

Example error messages:

[2024/10/02 11:16:42] [error] [config] inconsistent use of tab and space
[2024/10/02 11:16:42] [error] [config] error in /fluent-bit/etc/..2024_10_02_11_16_29.3759745087//application-log.conf:62: invalid indentation level
[2024/10/02 11:16:42] [error] configuration file contains errors, aborting.cwagent-daemonset.yaml
AWS OFFICIAL
AWS OFFICIALUpdated 2 months ago