Skip to content

ECS service connect: nginx connection is not stable

1

I have 2 services: nginx and nodejs in same VPC, cluster. I'm using ecs service connect to forward request from nginx to nodejs server. Now I'm facing with issue: when everything work fine, I force new deployment nginx service then nginx log show this error:

  • recv() failed (104: Connection reset by peer) while reading response header from upstream

After that, I redeployed the Nginx service, and it started working again. However, this issue keeps occurring after each subsequent deployment. The only way to get it working again is to redeploy it one more time. nginx.conf:

http {
...
upstream endpoint {
        server nodejs_service_dns_name:node_port max_fails=5 fail_timeout=30s;
}

server {
location ~* ^/(location|category)/$ {
        proxy_http_version 1.1;
        proxy_ignore_headers "Set-Cookie" "Expires" "Cache-Control";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_pass http://endpoint$request_uri;
        proxy_intercept_errors on;
        break;
      }
}
...
}
  • Hi Hien,

    Thank you for sharing your detailed setup! To provide more accurate guidance, could you clarify a couple of points about your architecture?

    1. Are all your services (Nginx and Node.js) hosted entirely within AWS, or do they involve any external systems (e.g., on-premises resources, third-party services, or external DNS)?
    2. Are you using any specific load balancer (like an AWS ALB/ELB) between Nginx and Node.js, or is the communication strictly via ECS Service Connect?
    3. Have you checked the Service Connect logs or CloudWatch metrics to identify any DNS resolution or connection issues during deployments?

    This additional information will help pinpoint the root cause and tailor the guidance to your specific scenario. Looking forward to your response! 😊 🚀

  • Hi Aaron! Thank you for your response. Below are the answers to your questions:

    1. Yes, both the Nginx service and the Node.js service are deployed on ECS (within the same cluster and VPC).
    2. The connection between Nginx and Node.js is made exclusively through ECS Service Connect, without using a Load Balancer (LB).
    3. I have checked the logs in CloudWatch. When the issue occurs, the Nginx service shows the following error logs:

    recv() failed (104: Connection reset by peer) while reading response header from upstream.

    Meanwhile, the Node.js service doesn't log any errors (it seems the request didn't reach the Node.js service). I also checked the /etc/hosts file of both services, and everything is correct (the nodejs_service_dns_name is mapped to 127.255.0.1 and an IPv6 address).

    What puzzles me is that this issue doesn't happen consistently but occurs intermittently after redeploying the Nginx service each time.

    If you need any additional information, please let me know. Thanks a lot!

  • Would be possible to provide a GitHub repository with a sample application and steps to reproduce the issue?

  • Sorry, this is private project so I can't public it on github. And I have additional information as follows: the upstream disconnection issue only occurs with network mode set to bridge and EC2 launch type. When I checked with Fargate - awsvpc, everything still works normally.

2 Answers
0

Greeting

Hi Hien,

Thank you for providing such detailed insights into your architecture and the troubleshooting steps you've already taken. I appreciate your clarity—it really helps in diagnosing the root cause of the issue you're experiencing! 😊 Let’s break this down and get your ECS Service Connect working seamlessly. 🚀


Clarifying the Issue

You’ve set up two services, Nginx and Node.js, within the same ECS cluster and VPC, connected through ECS Service Connect. The issue arises intermittently after redeploying the Nginx service, where you see the error:

recv() failed (104: Connection reset by peer) while reading response header from upstream

This indicates that Nginx fails to maintain a stable connection to the Node.js service after deployment. Although the /etc/hosts file and DNS mappings appear correct, the connection doesn’t seem to consistently forward requests to the Node.js service. The puzzling part is that the issue resolves only after a second redeployment of Nginx, and logs from Node.js show no received requests when the issue occurs.

This behavior points to a transient issue with Service Connect’s connection handling during the initial redeployment, potentially related to stale connections, DNS propagation delays, or misaligned configurations. Let’s address this step-by-step.


Why This Matters

When services in a microservices architecture fail to communicate reliably, it can lead to cascading failures, user-facing downtime, and a loss of trust in the system's reliability. In modern deployments, where frequent updates and redeployments are common, stable service-to-service communication is critical. Addressing this ensures that your applications remain robust, even during routine updates, reducing operational stress and improving user experience.


Key Terms

  • ECS Service Connect: A feature that simplifies communication between ECS services by managing service discovery, traffic routing, and DNS resolution.
  • Connection Reset by Peer (Error 104): A low-level error indicating the remote side of the connection abruptly closed it.
  • Upstream: The backend server (in this case, Node.js) that Nginx proxies requests to.
  • Transient Error: Temporary glitches in the system that resolve after retries or reinitialization.
  • Graceful Deregistration: The process of allowing connections to complete before removing a service during updates or scale-downs.

The Solution (Our Recipe)

Steps at a Glance:

  1. Configure Nginx to use a health check endpoint for the Node.js service.
  2. Modify the Nginx upstream block to include additional connection options.
  3. Ensure ECS Service Connect handles redeployments gracefully with deregister_delay.
  4. Test and validate the setup to ensure stability across redeployments.
  5. Monitor and analyze connection stability during redeployment tests.

Step-by-Step Guide:

  1. Configure Nginx to Use a Health Check Endpoint: Ensure your Node.js service exposes a lightweight health check endpoint (e.g., /health) that Nginx can use to verify backend availability.

    Example Node.js health check:

    app.get('/health', (req, res) => {
        res.status(200).send('OK');
    });

    This ensures that Nginx can confirm the Node.js service is ready to handle requests.


  1. Modify the Nginx upstream Block: Add options to handle connection retries, persistent connections, and graceful failover.

    Updated nginx.conf:

    upstream endpoint {
        server nodejs_service_dns_name:node_port max_fails=3 fail_timeout=10s;
        keepalive 16;  # Enable persistent connections
    }
    
    server {
        location /health {
            proxy_pass http://endpoint/health;
        }
    
        location ~* ^/(location|category)/$ {
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
            proxy_pass http://endpoint$request_uri;
        }
    }

  1. Ensure ECS Service Connect Handles Redeployments Gracefully: Update your ECS Task Definitions to use the deregister_delay parameter, allowing ECS to gracefully drain connections before terminating tasks:

    {
        "serviceRegistries": [
            {
                "deregisterDelay": 30
            }
        ]
    }

    This ensures existing connections to the Node.js service are not abruptly terminated during a redeployment.


  1. Test and Validate the Setup:
    • Deploy the updated Nginx configuration and monitor for stability across redeployments.
    • Simulate concurrent requests during deployments to test the resilience of your setup.

  1. Monitor and Analyze Connection Stability: Use CloudWatch and other AWS tools to monitor key metrics and troubleshoot issues:
    • CloudWatch Metrics:
      • Service Connect DNS Resolution Time: Ensure there are no significant delays in resolving the Node.js service DNS.
      • ECS Service Registry Updates: Check if Service Connect updates propagate correctly during redeployment.
      • Task Network Traffic Metrics: Look for unexpected traffic patterns or dropped packets.
    • CloudWatch Logs:
      • Review logs for both Nginx and Node.js services for errors or anomalies during deployment cycles.
    • AWS X-Ray (optional): Use distributed tracing to visualize and debug request flows across services.

Closing Thoughts

The intermittent nature of the issue suggests transient problems with connection management during redeployments. By implementing health checks, optimizing Nginx’s connection settings, and ensuring ECS handles deregistration properly, you can eliminate these disruptions.

For additional information, consider the following resources:

Let me know if you have further questions or need additional guidance. I’m here to help! 😊🚀


Farewell

Hien, I hope this solution provides clarity and helps you resolve the issue. Keep up the great work, and feel free to reach out if you encounter any other challenges—I'd be glad to assist! 💡🌟


Cheers,

Aaron 😊

answered a year ago
  • Thank you for your suggestion. I tried it, but unfortunately, it didn’t work. I also just realized that when the upstream error occurs, redeploying the Nodejs service (in addition to redeploying the Nginx service) can also resolve the issue. It’s strange, as I’ve tried many different approaches but still haven’t found a solution. If you have any other suggestions, please let me know.

0

Greeting

Hi Hien,

Thank you for the additional details! The fact that redeploying either Nginx or Node.js resolves the issue suggests the problem might involve how ECS Service Connect manages service discovery or connection handling during deployments. Let’s dive deeper and refine the approach to resolve this once and for all. 🚀


Updated Analysis

Your setup involves Nginx and Node.js services connected via ECS Service Connect, all within the same ECS cluster and VPC. The issue arises intermittently after deploying either service, with Nginx logging connection reset errors (recv() failed (104: Connection reset by peer)). Redeploying either service temporarily fixes the problem.

This pattern suggests transient issues related to DNS propagation, stale connections, or ECS Service Connect configurations. Here's an improved troubleshooting approach to stabilize communication and prevent these disruptions.


Refined Solution

1. Enhance DNS Resolution in Nginx

Ensure Nginx dynamically resolves DNS during deployments to avoid stale DNS records. Use fully qualified domain names (FQDNs) and the internal ECS DNS resolver:

resolver 127.0.0.11 ipv6=off valid=10s;  # Use ECS DNS resolver

upstream endpoint {
    server nodejs_service_dns_name.cluster.local:node_port max_fails=3 fail_timeout=10s;
    keepalive 16;
}

This minimizes DNS caching issues during service updates.


2. Enable Graceful Connection Draining

Ensure ECS tasks allow existing connections to complete before termination during deployments. Update your ECS service definition to include deregisterDelay for Service Connect:

"deploymentConfiguration": {
    "maximumPercent": 200,
    "minimumHealthyPercent": 100
},
"serviceRegistries": [
    {
        "deregisterDelay": 60
    }
]

This gives ECS time to drain connections gracefully before new tasks start.


3. Force Nginx to Flush DNS Cache

Add directives to Nginx to bypass DNS caching for upstream connections:

location ~* ^/(location|category)/$ {
    proxy_pass http://endpoint$request_uri;
    resolver 127.0.0.11 valid=5s;  # Refresh DNS every 5 seconds
    proxy_cache_bypass $http_cache_control;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
}

This ensures Nginx resolves the correct IP for the Node.js service, even during redeployments.


4. Add Debug Logging

Enable debug-level logging in Nginx to gather more information on connection resets:

error_log /var/log/nginx/error.log debug;

Review these logs during deployments to identify potential DNS resolution or connection handling issues.


5. Introduce Deployment Delays

Minimize overlap between old and new tasks during deployment by adjusting deployment parameters:

aws ecs update-service --service my-nginx-service --cluster my-cluster \
    --deployment-configuration maximumPercent=200,minimumHealthyPercent=50

This ensures tasks are replaced gradually, reducing potential connection interruptions.


Advanced Diagnostics

If these steps don’t resolve the issue, consider:

  • Service Connect Proxy Logs: Check for errors or anomalies in Service Connect’s handling of connections.
  • AWS X-Ray: Use distributed tracing to visualize and debug request flows.
  • Bypass Service Connect Temporarily: Test direct communication using private IPs to rule out Service Connect-specific issues.

Closing Thoughts

Hien, I hope these refined suggestions help stabilize your service communication during deployments. If you encounter any challenges or need further clarification, feel free to reach out. Keep up the great work—you’re close to solving this! 💪😊

Best regards,
Aaron 🚀 😊

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.