- 最新
- 最多得票
- 最多評論
What you're experiencing with API server latency spikes is concerning but not uncommon in Kubernetes environments, even with relatively small clusters like yours.
The high latency for LIST operations specifically is notable. LIST operations are particularly resource-intensive for the Kubernetes API server because they retrieve large amounts of data. When these operations take 10-60 seconds as you've observed, it can indeed cause cascading issues like failed Prometheus scrapes and pod health checks.
Several factors could be contributing to these latency spikes:
-
API request patterns: Even in small clusters, certain operations can generate a high volume of API requests. The frequency and volume of these requests can become a bottleneck for the control plane, leading to API latency.
-
Inefficient LIST operations: If applications or tools are performing broad LIST operations without limiting their scope, they can put significant load on the API server.
-
API throttling: The API server throttles inbound requests to protect the control plane from overload. What appears as a bottleneck might actually be this protective mechanism in action.
-
Downstream bottlenecks: The latency might not be in the API server itself but in downstream components like etcd.
To address this issue, consider these steps:
- Optimize Kubernetes API usage:
- Limit the scope of LIST operations
- Use watch operations instead of repeated LIST calls where possible
- Utilize resourceVersions to get only changes since a specific point
-
Investigate API Priority and Fairness (APF) settings: Check if the APF queues are configured appropriately for your API call patterns.
-
Monitor control plane metrics: Look at metrics beyond just the API server latency, such as etcd performance and queue depths.
-
Use CloudWatch Logs Insights: Create queries to identify the sources of high-latency API calls.
-
Implement shared informers: If you have multiple controllers watching the same resources, use shared informers to reduce API server load.
While EKS is managed, you still have options to optimize how your applications interact with the API server. Focus on reducing the frequency and scope of LIST operations, which should help mitigate these latency spikes.
Sources
Kubernetes Control Plane - Amazon EKS
Control Plane Monitoring - Amazon EKS
Looking at the metrics it doesn't seem to be any spike of requests during this time period. Prometheus does list operations of endpoints, services and pods which are the requests that are delayed a lot. However the responses should still be relatively small (below 100 pods), and it does these requests regularly. I cannot see any 429 / throttling in the API either
相關內容
- 已提問 3 個月前
