Skip to content

ECS Fargate: Service Connect proxy memory keeps growing after 28 Aug redeploy (Go, kotlin(springboot) app); gRPC traffic eventually stalls

1

Hi AWS re:Post,

We’re seeing the same symptom described in this thread, but in our case it’s even more pronounced:

Environment

  1. ECS Fargate service with Service Connect (sidecar proxy)

  2. Application written in Go, Kotlin(Springboot)

  3. No code changes between deployments

Timeline & Symptom

  1. After a routine redeploy on August 28 (no version change in our app), we observed monotonic memory growth in the Service Connect proxy container.
  2. The proxy’s memory does not drop after traffic subsides.

Enter image description here

Enter image description here

What we verified

  1. Using ECS Exec and top/ps, we confirmed our app processes return memory as expected after load.
  2. Container Insights shows only the Service Connect sidecar’s memory rising; the app container’s memory goes back down.
  3. This setup had been stable for months; the single redeploy (without code changes) appears correlated with the new behavior.

Impact

  1. Eventually, gRPC traffic via Service Connect stopped responding and we had an outage. The Service Connect proxy hit memory limits and died.

Ask

  1. Has anyone resolved this behavior?
  2. If so, what worked (e.g., proxy image/version change, specific configuration tweaks, task sizing, or OS/kernel settings)?
  3. Any confirmed mitigations (connection pool limits, circuit breakers, outlier detection, idle timeouts, etc.) that stop the memory from climbing?
  4. I'm curious as to how this can be fundamentally resolved.

We believe the root cause is within the Service Connect proxy since the app container releases memory and only the sidecar continues to grow. Any detailed guidance or lessons learned from others who’ve fixed this would be extremely helpful. Thanks!

  • I've observed this issue as well and started on 8/27. I've raised a support ticket with AWS.

asked 9 months ago314 views
3 Answers
2

I am witnessing the same behavior of the service connect sidecar. It uses most of the memory of a 2 GB ECS fargate service: Memory consumption of service connect sidecar

I raised a AWS support ticket and it is currently in internal investigation with the dev team.

In general I think it is reasonable to define resource limits on the service-connect container.

answered 9 months ago
EXPERT
reviewed 8 months ago
  • Thanks for sharing your experience.

    It appears the issue was caused by something internal to AWS ECS Service Connect.

    Even after rebuilding everything from scratch, the same problem persisted. However, when we ran the exact same application without Service Connect, we confirmed that memory usage no longer increased.

    So here’s how I’m working around it: • Don’t use Service Connect. • Place an internal NLB on the private network and switch inter-service communication to go through it. (You will need to distinguish services by port number.)

    In my view, NLBs offer much better performance and shouldn’t trigger this kind of latent issue, so they’re a solid choice.

    Also, if you run multiple tasks, each task spins up its own Service Connect sidecar, which I consider a waste of resources. With a single NLB, you avoid that overhead, and it’s much more reliable for scaling out.

1

AWS Support applied a fix for this and asked us to do a force deployment for it to take effect. We've confirmed that this addressed the issue.

answered 9 months ago
-2

Based on your description, you're experiencing a memory leak in the Service Connect Proxy sidecar container after a routine redeploy, which eventually leads to service outages when the proxy hits memory limits.

Here are some approaches to troubleshoot and potentially resolve this issue:

  1. Monitor Service Connect Proxy Metrics You can set up specific monitoring for the Service Connect Proxy sidecar container. AWS provides metrics specifically for Service Connect that can help identify issues with the proxy. This would give you visibility into the proxy's behavior and help confirm your suspicion about memory leaks.

  2. Vulnerability Scanning Consider activating AWS Inspector on your container images, particularly the Service Connect sidecar. Some vulnerabilities can manifest as memory leaks, and identifying and addressing these could potentially resolve the issue.

  3. Health Checks and Deployment Configuration Implement health checks for your tasks to better monitor the health of the Service Connect container. This can help ECS determine when containers are becoming unhealthy and replace them before they cause outages.

  4. Task Resource Allocation Review your task definition resources. You might need to increase the memory allocation for your tasks to accommodate the Service Connect proxy's behavior until a more permanent solution is found.

  5. Deployment Strategies Consider implementing different deployment strategies, such as rolling updates, to minimize the impact of any issues that arise with the Service Connect proxy.

  6. Verify Proxy Configuration Ensure your Service Connect configuration is correct. You can verify that your connection is using the Service Connect proxy by checking for the "server: envoy" header in responses.

  7. Restart Strategy As a temporary mitigation, you might consider implementing a scheduled task restart before memory usage reaches critical levels, though this is more of a workaround than a solution.

Since this issue appeared after a routine redeploy without code changes, it's possible that there was an underlying change in the Service Connect proxy version or configuration. If the issue persists after trying these approaches, reaching out to AWS Support might be necessary as they can provide more specific guidance based on your exact setup and the version of the Service Connect proxy you're using.
Sources
Monitoring Service Connect Proxy in ECS Fargate for Potential Memory Issues | AWS re:Post
ECS connect container unhealthy during new deployments to ECS EC2 | AWS re:Post
Troubleshoot service connect issues in Amazon ECS | AWS re:Post

answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.