Managing webhook failures on Amazon EKS
This article guides you on how to configure your Kubernetes webhook setup and use Amazon Elastic Kubernetes Service (Amazon EKS) to identify webhook failures proactively.
Introduction
Kubernetes admission webhooks are used in Kubernetes applications and open-source projects, such as the AWS Load Balancer Controller and Pod Identity Webhook, to provide the AWS Identity and Access Management (IAM) Roles for Service Accounts (IRSA) feature on Amazon EKS clusters. These projects use webhooks to extend the mutating and validation capabilities at runtime. These extended capabilities can include the following tasks:
-
Automatically inject sidecar containers
-
Manage external resources
-
Validate Kubernetes objects
While webhooks provide additional functionalities for Kubernetes clusters, they can also introduce unexpected failures. As part of the AWS Support team, we work with numerous customers who face challenges with webhook failures in their Amazon EKS clusters. Based on these experiences, we developed a deep understanding of the function of webhooks, common issues that occur, and troubleshooting processes.
Understanding webhooks
On Amazon EKS, MutatingAdmissionWebhook and ValidatingAdmissionWebhook admission controllers are turned on for Kubernetes versions 1.23 and later. These controllers allow Amazon EKS to dynamically configure webhooks and intercept certain types of Kubernetes API requests before persisting the data into etcd. The following diagram shows a general workflow of how webhook mutation and validation are integrated with the Kubernetes API server.
To configure these admission webhooks, you can create MutatingWebhookConfiguration or ValidatingWebhookConfiguration Kubernetes API objects on Amazon EKS.
For example, with an Amazon EKS cluster that's running AWS Load Balancer Controller v2.7.2, the load balancer controller creates a following mutating webhook configuration named aws-load-balancer-webhook. Then, the configuration starts a mutating action to set the Pod readiness gate. In the following example output, the webhook configuration specifies the Namespace Selector, Object Selector, and Rules. This determines if a request to the API server should be sent to aws-load-balancer-webhook-service:
$ kubectl describe MutatingWebhookConfiguration/aws-load-balancer-webhook
Name: aws-load-balancer-webhook
Namespace:
Labels: app.kubernetes.io/instance=aws-load-balancer-controller
app.kubernetes.io/name=aws-load-balancer-controller
app.kubernetes.io/version=v2.7.2
API Version: admissionregistration.k8s.io/v1
Kind: MutatingWebhookConfiguration
Webhooks:
Admission Review Versions:
v1beta1
Client Config:
Service:
Name: aws-load-balancer-webhook-service
Namespace: kube-system
Path: /mutate-v1-pod
Port: 443
Failure Policy: Fail
Match Policy: Equivalent
Name: mpod.elbv2.k8s.aws
Namespace Selector:
Match Expressions:
Key: elbv2.k8s.aws/pod-readiness-gate-inject
Operator: In
Values:
enabled
Object Selector:
Match Expressions:
Key: app.kubernetes.io/name
Operator: NotIn
Values:
aws-load-balancer-controller
Rules:
API Versions:
v1
Operations:
CREATE
Resources:
pods
...
With this mutating webhook configuration, the AWS Load Balancer Controller can complete a series of tasks and automatically inject the custom Pod readinessGates property. When you create Pods in the Kubernetes namespace with the label elbv2.k8s.aws/pod-readiness-gate-inject: enabled, the Kubernetes API server calls the AWS Load Balancer Controller webhook service. The webhook service then sets a readiness condition on the Pods that make up your Kubernetes service.
Kubernetes webhook failures might cause the following issues:
-
Unscheduled Pods
-
Failed deployments
Webhook failures can also cause blocked clusters that might prevent updates to node status. Because of complex troubleshooting processes, this situation is particularly challenging during events that affect production workloads.
To maintain cluster stability, you must identify and address webhook failures proactively. To mitigate webhook failures, monitor the webhook status regularly. In this article, you will learn how to identify common patterns of webhook failure, outline a strategy for detecting failures proactively, and resolve webhook failures on Amazon EKS.
Part 1: Identify common patterns of Amazon EKS cluster webhook failure
Unavailable endpoints that block system functionality
To find out why a webhook invocation request is unsuccessful, check Failure policy in the webhook configuration. If the value of Failure policy is Fail, then a webhook call error caused the admission to fail and the API request to be rejected:
$ kubectl describe MutatingWebhookConfiguration/aws-load-balancer-webhook
Name: aws-load-balancer-webhook
API Version: admissionregistration.k8s.io/v1
Kind: MutatingWebhookConfiguration
Webhooks:
...
Failure Policy: Fail
...
A webhook service might reject a request because of an internal error and cause the admission webhook to fail. To identify this type of error, you can review your Kubernetes events. For example, the following error results from an AWS Load Balancer Controller rejecting the request because the associated certificate is expired.
$ kubectl describe ingress demo-ingress
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedDeployModel 53m (x9 over 63m) ingress (combined from similar events): Failed deploy model due to Internal error occurred: failed calling webhook "mtargetgroupbinding.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=30s": x509: certificate has expired or is not yet valid: current time 2022-03-03T07:37:16Z is after 2022-02-26T11:24:26Z
Pod status is stuck or can't be deleted
When the webhook fails to start or respond, it can lead to incorrect Kubernetes Pod statuses. This might potentially affect the entire cluster. In this case, you might not be able to delete the Pods, even after you delete the Kubernetes deployment.
Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Make sure that you're using the most recent AWS CLI version.
Example:
To delete your Kubernetes deployment, use the kubectldelete deployment command:
# Delete deployment
$ kubectl delete deployment demo-deployment
To check for existing ReplicaSets, use the kubectl get replicasets command:
# ReplicaSet and Pods still exists
$ kubectl get replicasets
NAME DESIRED CURRENT READY AGE
demo-deployment-7b95fd5f56 2 2 2 75m
demo-deployment-fc84c4f49 2 2 2 5m19s
The output shows that ReplicaSets that are associated with the deployment are still present.
To verify if the Pods exist and their statuses, use the kubectl get pods command:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
demo-deployment-7b95fd5f56-79l9g 1/1 Running 0 82m
demo-deployment-7b95fd5f56-lqjqj 1/1 Running 0 82m
demo-deployment-fc84c4f49-5jw2l 1/1 Running 0 12m
demo-deployment-fc84c4f49-cq624 1/1 Running 0 12m
The output shows that the Pods that the deployment managed are still running, even after the deployment was deleted.
Cluster status can't be updated because kube-controller-manager blocks any action
A common issue in cluster management is the inability to update cluster resource statuses, such as the status of Pods, deployments, or jobs. This issue can occur when the kube-controller-manager enters an infinite loop because of specific actions. For example, you might not be able to update the cluster status when a Kubernetes deployment continuously tries to reschedule Pods and start the webhook through the deployment controller. These infinite loops occur when there's a problem with the webhook configuration, or the webhook service endpoint fails to respond.
Because of the loop, controllers embedded in the kube-controller-manager, such as the job controller, can't take any actions or create any jobs. This situation persists until the webhook request successfully completes and the Kubernetes deployment reaches the number of running replicas that you need.
Critical add-ons failed to deploy and caused worker nodes to be NotReady
When a webhook fails or is unresponsive, critical components such as kube-proxy or aws-node, an Amazon Virtual Private Cloud (Amazon VPC) CNI plugin, might fail to deploy. These missing critical components can result in a NotReady status for the worker nodes. For example, it can be common to see a webhook failure block a CNI plugin deployment that causes a node status to be NotReady. The CNI plugin deployment failure that the webhook issue caused prevents the Amazon plugin from deploying to worker nodes. This failure to deploy results in the unavailability of worker nodes.
$ kubectl describe daemonsets aws-node --namespace kube-system
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 18s (x14 over 60s) daemonset-controller Error creating: Internal error occurred: failed calling webhook "mpod.elbv2.k8s.aws": failed to call webhook: Post "https://webhook-service.system.svc:443/mutate-v1-pod?timeout=10s": service "webhook-service" not found
Part 2: Monitor the webhook
The Kubernetes API server is responsible for starting webhooks. For example, in Kubernetes 1.29, the API server implemented an embedded dispatcher to delegate mutation and validation of operations. With this feature, you can apply the following observability strategies on Amazon EKS to detect webhook failures proactively.
Amazon EKS control plane logging
Amazon EKS provides built-in control plane logging that stream the logs directly from the Amazon EKS control plane to Amazon CloudWatch Logs within your account. For more information on how to turn on these logs for your account, see Send control plane logs to CloudWatch Logs.
To capture and diagnose webhook failures for your Kubernetes cluster components, turn on API server and controller manager logs to filter errors that occur in the Amazon EKS cluster.
Note: For logs that Amazon EKS sends to CloudWatch Logs from your clusters, you're charged additional standard CloudWatch Logs data ingestion and storage costs. To optimize your log costs, you can set up log retention in CloudWatch Logs.
CloudWatch Logs Insights
Amazon CloudWatch Logs Insights is an extended feature of CloudWatch Logs that streamlines discovering, categorizing, and visualizing your CloudWatch log data. After you turn on control plane logging, you can use the CloudWatch Logs Insights queries to find the timestamp for when the failed to call webhook error message first occurs:
fields @timestamp, @message, @logStream
| filter @logStream like /kube-apiserver/
| filter @message like 'failed to call webhook'
CloudWatch alarms
With CloudWatch alarms, you can create alarms that alert you when specific thresholds are breached. To get notifications about webhook issues, you can set these alarms for webhook failure metrics. Also, you can create CloudWatch alarms that use a metric filter to search for Kubernetes webhook failure messages. The metric filter looks for specific patterns in your logs that indicate webhook failures, such as failed calling webhook.
Prometheus metrics
The Kubernetes API server that runs on Amazon EKS displays metrics in the Prometheus format. You can use these metrics to monitor and analyze your webhooks data, including the total number of admission webhooks that are called and latency of these calls. For example, you can analyze the following metrics from the admission metrics source code of Kubernetes 1.29:
-
webhook_rejection_count
-
webhook_fail_open_count
To identify performance issues or failures in your webhooks, use the following kubectl command and filter the raw response:
[ec2-user@ip-1-1-1-1 ~]$ kubectl get --raw /metrics | grep "apiserver_admission_webhook"
...
apiserver_admission_webhook_request_total{code="400",name="mpod.elbv2.k8s.aws",operation="CREATE",rejected="true",type="admit"} 17
[ec2-user@ip-1-1-1-1 ~]$ kubectl get --raw /metrics | grep "apiserver_admission_webhook_rejection"
...
apiserver_admission_webhook_rejection_count{error_type="calling_webhook_error",name="mpod.elbv2.k8s.aws",operation="CREATE",rejection_code="400",type="admit"} 17
Also, you can check the Prometheus metrics that the application provides to maintain webhook service availability and avoid failure. For example, AWS Load Balancer Controller v2.7.2 provides several Prometheus metrics to view the application runtime status.
You can collect cluster Prometheus metrics with Amazon Managed Service for Prometheus and visualize the data with Grafana open source or Amazon Managed Grafana. To use this process, see Amazon Managed Service for Prometheus collector provides agentless metric collection for Amazon EKS.
Solution overview
To manage webhook failures proactively, you can use Amazon EKS control plane logging and CloudWatch. The following example configuration involves turning on Kubernetes API server logs that are then processed by a metric filter to capture errors in CloudWatch Logs. Then, these errors are converted into CloudWatch metrics that you can use to monitor webhook errors in real time.
Prerequisites
Make sure that you meet the following prerequisites:
-
You have an Amazon EKS cluster with control plane logging turned on (kube-apisever).
-
You have permissions to access the CloudWatch console and features.
Solution walkthrough
Step 1: Create a CloudWatch metric filter
In the CloudWatch console, create a metric filter for the log group that contains your Kubernetes API server logs. Use the following configuration:
-
Define the filter pattern to match the webhook failure messages. For example, use the term failed to call webhook.
-
Enter a name and namespace for the new custom metric. Example:
-
Filter Name: FilterWebhookCallFailure
-
Metric namespace: EKSCluster
-
Metric name: WebhookCallErrorCount
-
Metric value: 1
Note: This specifies that the count is incremented by 1 for every log event that contains failed to call webhook. -
Default value: 0
After you create the metric, you can review the available CloudWatch metrics to capture the error proactively. You can also search and view metrics, including the WebhookCallErrorCount metric that you created.
Step 2: Detect errors with a CloudWatch alarm
After you define the metric, create a CloudWatch alarm based on a static threshold to extend proactive detection and verify the webhook status.
To view the logs, choose Related Logs. To identify the webhook error, you can review relevant logs and query the logs for failed to call webhook.
When you set up a CloudWatch alarm, you can include additional actions that CloudWatch must take when the metric breaches the threshold. For example, you can configure a CloudWatch alarm to send an Amazon Simple Notification Service (Amazon SNS) notification to the AWS Chatbot over Slack. When the webhook call generates an error and the CloudWatch alarm breaches the expected threshold, the integration sends a Slack notification. This notification alerts you of any errors or issues for an immediate response and resolution. The following screenshot shows an example Slack notification when the webhook call fails:
Step 3: Fix the webhook failure
Option 1: Recover from "no endpoints available for service" errors
For most webhook configuration changes, the Kubernetes service and its endpoints require an action from a mutation or validation action. The following example shows the unreachable endpoints of an AWS Load Balancer Controller. To view the service and endpoint status, run the kubectl command:
[ec2-user@ip-1-1-1-1 ~]$ kubectl describe service aws-load-balancer-webhook-service --namespace kube-system
Name: aws-load-balancer-webhook-service
Namespace: kube-system
...
Port: webhook-server 443/TCP
TargetPort: webhook-server/TCP
Endpoints: <none>
[ec2-user@ip-1-1-1-1 ~]$ kubectl get endpoints --namespace kube-system'
NAME ENDPOINTS AGE
aws-load-balancer-webhook-service <none> 4d19h
The output shows that the endpoints have a status of none and are unreachable.
If Pods aren't correctly running, then use the kubectl get pods command to review the current pod status:
[ec2-user@ip-1-1-1-1 ~]$ kubectl get pods --namespace kube-system
NAME READY STATUS RESTARTS AGE
aws-load-balancer-controller-5568b494f7-j597c 0/1 Pending 0 17s
aws-load-balancer-controller-5568b494f7-ljd5q 0/1 Pending 0 4s
Either of the following conditions might affect the availability of a webhook:
-
A node status of NotReady
-
An internal error of your webhook service
To inspect worker node details and Pod runtime status, run the following commands to check the container logs:
kubectl describe node NODE_NAME
kubectl logs POD_NAME --namespace POD_NAMESPACE
Option 2: Turn off a webhook
If you temporarily ignore connection errors and timeouts, then you can sometimes relieve the production impact for the cluster. To skip mutation and validation actions, you can edit the webhook and set the failure policy to Ignore:
kubectl edit ValidatingWebhookConfiguration WEBHOOK_NAME
kubectl edit MutatingWebhookConfiguration WEBHOOK_NAME
Note: Replace WEBHOOK_NAME in the preceding commands with the name of the webhook.
Search for failurePolicy, and then update the value of this parameter to Ignore. Save the configuration and exit the editor. The following snippet shows a sample mutating webhook configuration of the AWS Load Balancer Controller with the failurePolicy
set to Ignore
:
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: aws-load-balancer-webhook
webhooks:
- clientConfig:
service:
name: aws-load-balancer-webhook-service
namespace: kube-system
path: /mutate-v1-pod
failurePolicy: Ignore # Search failurePolicy and replace "Fail" to "Ignore"
name: mpod.elbv2.k8s.aws
...
If the failure policy update doesn't resolve the issue and you want to roll back, then set the value back to the original policy to revert the action. Then, repeat the previous steps and change the value of the parameter to Fail.
Option 3: Remove a webhook
If the preceding strategies don't unblock the webhook failure, then temporarily remove the webhook as the final option.
Save the existing webhook configuration to a file before you delete it. Replace WEBHOOK_NAME in the following command with the name of the webhook:
kubectl get ValidatingWebhookConfiguration WEBHOOK_NAME -o yaml > validating-webhook-config.yaml
kubectl get MutatingWebhookConfiguration WEBHOOK_NAME -o yaml > mutating-webhook-config.yaml
Then, run the following commands to delete the webhook configuration:
kubectl delete ValidatingWebhookConfiguration WEBHOOK_NAME
kubectl delete MutatingWebhookConfiguration WEBHOOK_NAME
Cleanup
After you troubleshoot your webhook failure, follow these steps to clean up:
-
Delete any resources that you created during this walkthrough.
-
Delete the CloudWatch alarm.
-
If you created a custom metric, then the metric automatically expires according to the retention schedule.
-
If you tested the chatbot integration, then delete the Amazon SNS topic and configured AWS Chatbot clients.
-
Turn off any Amazon EKS control plane logs.
-
Delete the CloudWatch metric filter that you created in the previous steps to avoid incurring further costs.
Conclusion
Use the strategies in this article to learn the type of webhook failures that might affect your cluster availability, how to use alarms to detect these errors proactively, and how to use CloudWatch Logs to review the status of webhooks. These strategies can help you improve the operational state of your Amazon EKS environment and quickly address any webhook-related issues to recover your Amazon EKS cluster state.
With Premium Support plans, you get expert guidance, best practices, and technical support to help you maximize the value of your AWS services. To learn more, see AWS Support. To learn more about managing webhooks for Amazon EKS, check out the Admission Webhooks section of the Amazon EKS best practices guide.
About the authors
Eason Cao
Eason is a Senior Cloud Support Engineer with over 5 years of industry experience specializing in AWS container solutions. As a subject matter expert in container services at AWS, he's dedicated to helping customers overcome cloud environment challenges and optimize distributed systems.
Kuo-Le Mei
Kuo-Le is a Cloud Support Engineer specializing in compute and container solutions at AWS. He brings his expertise and enthusiasm for technology to help clients scale their businesses efficiently so that they can focus on their core activities.
Relevant content
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 5 months ago
- AWS OFFICIALUpdated 6 months ago