I believe you are on the right path in achieving your desired use-case. Since you already have a cloudwatch alarm that detects when your nodes are in
Not Ready or
Unknown state, you will need to setup an Eventbridge rule.
During the Eventbridge rule creation, select the target type as AWS Service and the target as Lambda function. This will make sure that whenever your alarm state change occurs, the Eventbridge rule invokes the Lambda function specified.
Within the Lambda function, you can retrieve the node name from the event received, and perform an API call to EKS API Server to check the node's status.
If node status is
Not Ready or
Unknown, retrieve its node labels and get its instance ID (Make sure that your kubernetes nodes are tagged with their corresponding instance IDs).
Finally, perform a RebootInstances API call to reboot the instance. I'd advise that you setup an SES email notification to notify you whenever the lambda succeeds or fails.
Note: Please be advised that
Not Ready node status can happen for a variety of reasons and some failures might even persist even after reboots. If that happens, your controller workflow might go into an infinite loop trying to reboot a faulty node over and over without success.
To mitigate this, I suggest that you store the Instance IDs that you rebooted into a database (e.g. dynamodb) so that you can check if the node has already been rebooted recently and set a max-tries condition. After max-tries on a particular node, stop rebooting and send an SES email notification so that you can investigate further.
If your nodes have access through SSM, you can run AWSSupport-CollectEKSInstanceLogs automation runbook on the failed node so that it collects the logs needed for further troubleshooting.
I hope this is helpful. Please leave a comment if you have any questions.
EKS Node Group with RIAccepted AnswerEXPERTasked 2 years ago
EKS Worker-node joinasked 3 years ago
EKS Node Group Strategyasked 2 months ago
How EKS Control Plane is communicate with worker nodeasked a year ago
Custom controller to monitor the node state and make AWS API calls to reboot EKS nodeasked 14 days ago
Worker Node group doesn't join the EKS clusterasked 2 months ago
Difference between EKS managed node group and self-managed node groupAccepted Answerasked 10 days ago
kube-controller in my EKS is consuming all the cpu and memory Resourcesasked 3 months ago
EKS static IPs for managed node group nodesAccepted Answerasked 2 years ago
How to remove a specific node from EKS node group (none managed eksctl)Accepted Answerasked 4 months ago