ECS Tasks CPU hard limit without increase of cpu reservation

0

Hi :)

I am currently trying to resolve an issue with our ECS EC2-based cluster. Our Task definitions utilize the container-level soft CPU limit (set to 50 CPU units), but dont utilize the hard CPU limit from task-definition level. In our case we have more than 2000 services running, each one with a single task. These tasks have new revisions being re-deployed at a very high rate and at the same time, and upon startup they often reach levels of CPU usage above 1000%. This causes the whole EC2 instances to become unresponsive, resulting in a need to restart the whole machine. This we have solved with an alarm and a lambda for quickly rebooting failing, unresponsive instances.

But this is not a solution, only a temporary fix. What we would like to achieve is to somehow limit the CPU usage of each task to not exceed the soft limit by factors of 10 or more. I have found a way to do it using the hard CPU limit, but this solution is also not great, mainly for the following reasons:*

  1. Our tasks even with the soft limit use at most 50% of this reservation, but the hard limit's minimum value for ECS with EC2 is 128 units (compared to current 50)
  2. The hard limit automatically increases the reservation value for the chosen task, meaning that setting the limit to 128 for all 2000+ servcies/tasks would require us to host more than double the amount of EC2 machines, without an actual gain, as our cluster's usage right now hangs around 5-10%.

So my question is, is there a way to somehow limit the maximum cpu usage of each task/container without using the task-level hard limit? Our EC2 machines are running on Ubuntu

asked a year ago125 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions