ECS Fargate service is getting scaled up despite no autoscaling policies being defined

1

I have an ECS cluster and service. The service polls for data from an external system, there is no load balancer and no APIs. CPU usage is typically around 5-10% but once a month or so it may briefly peak to 70%+. I've noticed that when this happens, another instance of the service gets spun up and then the old instance is shut down a minute or two later. This has caused problems because the service was never designed to run more than 1 instance, and the two concurrent instances compete for resources and cause race conditions while working on the same data. I'm trying to understand why it's scaling and how to configure it not to.

In the console the service says it has 1 desired task and no Auto Scaling resources configured for this service. The cluster has no default capacity provider strategy and no capacity providers specified. Someone on StackOverflow suggested that ECS may be treating the service as unhealthy and replacing it with a new instance. However, the docs I've been reading regarding health checks don't seem to apply in my case. I have no explicit health checks defined, there is no HTTP endpoint for ECS to query. And the service continues running without errors during this high CPU load, processes the data, and resumes idle polling until eventually a graceful shutdown is initiated (presumably) by ECS.

I've found no hints or clues in the logs. The application logs have no errors or warnings, just standard business logic logging. I can see the second instance start initializing and enter a running state, polling for data and potentially processing it. Roughly 1-2 minutes later I can see the first instance start to shut down gracefully, cleaning up resources and so on.

In the service deployment logs I can see service MyService has started 1 tasks: task <instance 2 id> and then 1 minute later service MyService has stopped 1 running tasks: task <instance 1 id>

I went back 6 months and found 5 cases of noticeably high CPU spikes. In all 5 cases this new task instance was spun up right after. Where can I look for more clues as to why this is happening?

2 Answers
1

i will suggest you check your AWS CloudWatch Alarms because auto-scaling behavior is often triggered by CloudWatch alarms. If there are any alarms set up based on CPU usage, they might be scaling your service. Also you could refer to this AWS documentation https://docs.aws.amazon.com/AmazonECS/latest/developerguide/troubleshoot-service-auto-scaling.html to better troubleshoot your issue

Hope it clarifies and if it does I would appreciate answer to be accepted so that community can benefit for clarity, thanks ;)

profile picture
EXPERT
answered 3 months ago
  • Good suggestion, but there are no alarms configured for this service.

0
Accepted Answer

Sometimes, Fargate tasks need to be replaced spontaneously. This can happen due to several causes:

  • An underlying hardware degradation has occurred, and so the task must be migrated to a replacement instance
  • To ensure customers are secure, we must periodically apply patches to the underlying hardware, OS, and/or container runtime. Sometimes these require tasks to be migrated to a replacement instance or restarted.

Fargate will make every effort to respect your service's deployment rules, so one thing you can do is to set your deployment configuration's maximumPercent to 100, and set its minimumHealthyPercent to 0. However, this is not a completely safe solution, as there are still situations in which more than one task could be running. For so-called singleton services such as yours, we strongly advise customers to employ a distributed lock (basically, a mutex shared across different tasks) so that only one task is able to process work simultaneously. An inexpensive distributed lock pattern that uses DynamoDB can be found here. Redis, via AWS ElastiCache, can also be used to implement a distributed lock.

AWS
EXPERT
answered 3 months ago
  • Thanks, I'll add the percentage config and look into setting up a lock.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions