Custom controller to monitor the node state and make AWS API calls to reboot EKS node


Hi! I want some help to start creating a custom controller that can monitor the status of EKS worker node and reboot the node in case when it went to "NotReady" or "Unknown" state. I allready have an alarm in CloudWatch that triggers when a node fails. There is no any native AWS solution to do a reset in the node that fails. So, I'm thinking that an alarm when goes on (let's say it "EKSfailNodeCount"), it will trigger a lambda function, which will get the node status from the cluster and the instance id of the notReady/unknown nodes and call and EC2 reboot instance api call.

So the custom solution will look like this:

When a node goes into "NotReady/Unknown" state === > Alarm gets triggered for "EKSfailNodeCount" metric === > Based on the alarm event bridge can trigger a lambda function === > Lambda function will have a script that will execute kubectl commands to get not ready nodes and execute an aws api to reboot instance.

The above structure can be implemented in our case to trigger the "Reboot" ec2 instance whenever the instance gets into "NotReady" state.

So, has anyone some Ideas on how to achieve this?


1 Answer

Hello ChrisAth,

I believe you are on the right path in achieving your desired use-case. Since you already have a cloudwatch alarm that detects when your nodes are in Not Ready or Unknown state, you will need to setup an Eventbridge rule.

During the Eventbridge rule creation, select the target type as AWS Service and the target as Lambda function. This will make sure that whenever your alarm state change occurs, the Eventbridge rule invokes the Lambda function specified.

Within the Lambda function, you can retrieve the node name from the event received, and perform an API call to EKS API Server to check the node's status.

If node status is Not Ready or Unknown, retrieve its node labels and get its instance ID (Make sure that your kubernetes nodes are tagged with their corresponding instance IDs).

Finally, perform a RebootInstances API call to reboot the instance. I'd advise that you setup an SES email notification to notify you whenever the lambda succeeds or fails.

Note: Please be advised that Not Ready node status can happen for a variety of reasons and some failures might even persist even after reboots. If that happens, your controller workflow might go into an infinite loop trying to reboot a faulty node over and over without success.

To mitigate this, I suggest that you store the Instance IDs that you rebooted into a database (e.g. dynamodb) so that you can check if the node has already been rebooted recently and set a max-tries condition. After max-tries on a particular node, stop rebooting and send an SES email notification so that you can investigate further.

If your nodes have access through SSM, you can run AWSSupport-CollectEKSInstanceLogs automation runbook on the failed node so that it collects the logs needed for further troubleshooting.

I hope this is helpful. Please leave a comment if you have any questions.

Thank you!

profile picture
answered 10 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions