Skip to content

We have a problem with the etcd server atttached to our EKS cluster observability-prod

-1

We are getting the following error message

Error: UPGRADE FAILED: rpc error: code = ResourceExhausted desc = etcdserver: throttle: too many requests

Kubernetes instances are either not being created or partially created

We noticed that we have the following issue:

Severity: Production system impaired
State: etcd throttling causing failed deployments
Request: etcd rate limit increase and control plane scaling
Error message: Error: UPGRADE FAILED: rpc error: code = ResourceExhausted desc = etcdserver: throttle: too many requests

We found using Cloudwatch that our Grafana instances which we run one per customer has a: Each sidecar opens 120 watch connections/hour on configmaps Simultaneous Helm release storm compounded the request rate Result: etcd ResourceExhausted throttling

But we cannot fix the issue because we are unable to accurately deploy any Kubernetes objects or change in a consistent manner any existing objects

We need the etcd rate limit increased so we can at least make some changes

asked 13 days ago61 views
1 Answer
0

I understand you're experiencing etcd throttling issues on your EKS cluster. However, I need to clarify an important point: the etcd rate limit in Amazon EKS is not configurable or adjustable. The etcd database has a fixed size limit, and when exceeded, it can cause your cluster to enter a degraded state.

Based on the symptoms you're describing - excessive watch connections from Grafana sidecars and Helm release storms - you're likely hitting etcd's capacity limits rather than needing a rate limit increase. The solution is to reduce the load on etcd by addressing the root causes:

Immediate actions to reduce etcd load:

  1. Reduce unnecessary Kubernetes objects - Clean up any unnecessary ConfigMaps, Secrets, or other objects that are contributing to the etcd database size

  2. Address the Grafana sidecar watch connections - 120 watch connections per hour per instance is excessive. You'll need to:

  • Optimize or reconfigure the sidecars to reduce watch frequency
  • Consider alternative approaches to ConfigMap monitoring that don't require constant watch connections
  • Potentially batch or consolidate ConfigMap updates
  1. Control Helm deployments - Avoid simultaneous Helm release storms by:
  • Staggering deployments across your customer instances
  • Implementing rate limiting or queuing for Helm operations
  • Using deployment strategies that reduce concurrent API calls

The key issue is that your current architecture is generating too much load on etcd. Rather than increasing limits (which isn't possible), you need to reduce the number of API requests and watch connections your workloads are making. Once you've reduced this load, you should be able to make changes to your cluster more reliably.

If you continue experiencing issues after cleanup and optimization, you should contact AWS Support for additional guidance specific to your cluster's situation.
Sources
Troubleshoot problems with Amazon EKS clusters and nodes - Amazon EKS
Control Plane Monitoring - Amazon EKS

answered 13 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.