Skip to content

Memory usage increase on ECS services after migrating to amazon linux 2023

0

Hello, We recently migrated our ECS clusters from Amazon Linux 2 to Amazon Linux 2023, and we encounter a massive increase in term of memory usage on all of our containers.

We use ECS on EC2 with ECS-optimized AMIs, no custom configuration, we changed nothing except the AMI.

As you can see here, memory usage on ecs services almost doubled right after the AMI change. Container memory usage

Which mechanically increases the memory usage on our EC2. EC2 memory usage

We can observe this behavior on all our services for all clusters (around 40 services, 12 clusters). Our applications still run correctly and we haven't observed any oom kills or any other issues for the moment, which leads me to believe this is more a matter of different memory management between the two OS rather than a real problem of overconsumption.

Is there anything that could explain this issue ?

3 Answers
0

Hi,

Do you have old versions of some packages in the ECS container images ? Or are you running with up-to-date for each packages ?

I have seen similar issues due to the fact that old versions of some packages where not working well with a newer Linux kernel.

Best,

Didier

EXPERT
answered 2 years ago
  • Thanks for your answer. Our base images are "amazoncorretto:21.0.3-al2023" or "amazoncorretto:21.0.2-al2023" for most of our containers, the rest is essentially (a pretty much up to date) node:alpine.

    I'm (almost) certain we don't use packages older than a few months

0

AL2 used memory cgroupsv1 AL2023 uses memory cgroupsv2 https://docs.aws.amazon.com/linux/al2023/ug/resource-limiting-raw-cgroups.html cgroupsv2 tracks ram more accurately, a specific example of something cgroupsv2 tracks correctly that cgroupsv1 didn't bother with because it's usually relatively small is ram usage associated with dirty writes. (that's an optimization where os writes to ram b4 disk. After writing to ram OS white lies to a program by informing it that it's successfully written to the disk, and then a background process writes from ram to disk.)

It's also a significant enough change that there might also be some changes in per pod overhead, and a small amount of additional overhead might build up since it gets multiplied by number of pods.

There's probably also changes in OS level overhead, this one's easy to detect if you run kubectl describe node on an old node you can check it's allocatable ram. Then run kubectl describe node on a new node (of the same size) and you'll probably see a different amount of allocatable ram.

answered a year ago
0

We encountered similar issues when migrating from AL2 to AL2023. In our case, we had some containers that were using an older JVM that isn't compatible with cgroupsv2. This prevented the container from querying operating system metrics. The following issue filed in the amazon-eks-ami repo was helpful in our troubleshooting: https://github.com/awslabs/amazon-eks-ami/issues/1866

To check whether your java containers are encountering this problem, you can run java -XshowSettings:system -version in the pod. If the result is similar to the following, then you should be good to go:

Operating System Metrics:
    Provider: cgroupv2

If instead you see this, then you will need to upgrade to a version that supports cgroupsv2:

Operating System Metrics:
    No metrics available for this platform
answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.