- Newest
- Most votes
- Most comments
Hello ChrisAth,
I believe you are on the right path in achieving your desired use-case. Since you already have a cloudwatch alarm that detects when your nodes are in Not Ready
or Unknown
state, you will need to setup an Eventbridge rule.
During the Eventbridge rule creation, select the target type as AWS Service and the target as Lambda function. This will make sure that whenever your alarm state change occurs, the Eventbridge rule invokes the Lambda function specified.
Within the Lambda function, you can retrieve the node name from the event received, and perform an API call to EKS API Server to check the node's status.
If node status is Not Ready
or Unknown
, retrieve its node labels and get its instance ID (Make sure that your kubernetes nodes are tagged with their corresponding instance IDs).
Finally, perform a RebootInstances API call to reboot the instance. I'd advise that you setup an SES email notification to notify you whenever the lambda succeeds or fails.
Note: Please be advised that Not Ready
node status can happen for a variety of reasons and some failures might even persist even after reboots. If that happens, your controller workflow might go into an infinite loop trying to reboot a faulty node over and over without success.
To mitigate this, I suggest that you store the Instance IDs that you rebooted into a database (e.g. dynamodb) so that you can check if the node has already been rebooted recently and set a max-tries condition. After max-tries on a particular node, stop rebooting and send an SES email notification so that you can investigate further.
If your nodes have access through SSM, you can run AWSSupport-CollectEKSInstanceLogs automation runbook on the failed node so that it collects the logs needed for further troubleshooting.
I hope this is helpful. Please leave a comment if you have any questions.
Thank you!
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 8 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago