- Newest
- Most votes
- Most comments
Based on the information provided, it appears that you've encountered a significant issue while attempting to update your EKS cluster from version 1.27 to 1.28, particularly with the GPU worker node. The problem seems to stem from a mismatch between the worker node and cluster versions, which has persisted through previous updates.
The most critical issue identified is the AMI version mismatch. The warning message indicates that the worker node AMI version is still at 1.25, while the cluster is attempting to update to 1.28. This version gap is likely the primary reason for the node creation failure and the inability of new nodes to join the cluster.
To resolve this issue, you should focus on the following steps:
-
Update Worker Node AMIs: Before attempting to update the cluster version, ensure that all worker nodes, including both CPU and GPU nodes, are running on an AMI version that is compatible with Kubernetes 1.28. Ideally, you should update the worker nodes to version 1.27 or 1.28.
-
Staged Update Process: Instead of jumping directly from 1.25 to 1.28 for the worker nodes, consider a staged update process. Update the worker nodes to 1.26, then 1.27, and finally to 1.28. This gradual approach can help mitigate compatibility issues.
-
GPU Node Specific Checks: Since the GPU node is failing to update, verify that you're using a GPU-compatible AMI and that all necessary GPU drivers and software are up-to-date and compatible with the target Kubernetes version.
-
Add-on Compatibility: Ensure that all cluster add-ons, especially those related to networking (like VPC CNI) and GPU support, are updated to versions compatible with Kubernetes 1.28.
-
Node Group Update Strategy: Review and possibly adjust your node group update strategy. Consider using a blue/green deployment strategy for worker nodes to minimize downtime and reduce the risk of update failures.
-
Cluster Upgrade Process: In the future, always ensure that worker nodes are updated to a compatible version before upgrading the cluster control plane. This helps maintain version alignment and reduces the risk of compatibility issues.
-
CloudWatch Logs Deep Dive: Conduct a more thorough analysis of the CloudWatch logs, focusing specifically on the GPU node's startup process and any errors it encounters when attempting to join the cluster.
-
IAM Role Review: Double-check the IAM roles associated with the GPU nodes to ensure they have all necessary permissions, including any GPU-specific requirements.
If these steps do not resolve the issue, you may need to consider creating a new node group with the correct AMI version and gradually migrating workloads to the new group. In extreme cases, you might need to create a new cluster with the desired version and migrate your workloads to ensure a clean, up-to-date environment.
Sources
Update existing cluster to new Kubernetes version - Amazon EKS
UpdateClusterVersion - Amazon EKS
Updating an Amazon EKS add-on - Amazon EKS
Community | Amazon EKS Upgrade Guidance (v1.25 to 1.26)
Hello,
If your cluster has node groups with GPU support (for example, p3.2xlarge), you must update the NVIDIA device plugin for Kubernetes DaemonSet on your cluster. Replace vX.X.X with your desired NVIDIA/k8s-device-plugin version before running the following command.
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/vX.X.X/deployments/static/nvidia-device-plugin.yml
More info: https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html#step4
Relevant content
- asked 3 years ago
- asked 2 years ago
- AWS OFFICIALUpdated 9 months ago