Skip to content

Silent WebSocket disconnects on r8i / r8i-flex / m8i instances after ~5 minutes of inactivity (no close/error event detected)

0

We’re using r7i/r7iz instances to host Node.js processes within ECS-optimized AMIs. These processes connect to a WebSocket server inside our VPC. No load balancers or proxies are used in between.

On r7i/r7iz, the WebSocket connection is stable and persistent. When we restart the WebSocket server, clients immediately detect the disconnect and reconnect as expected.

After migrating to r8i / r8i-flex / m8i, the behavior changes. WebSocket connections become idle after roughly five minutes of inactivity. Clients stop receiving messages, fail on attempts to send, and never detect a disconnect - they still believe the connection is alive. This happens consistently across all new instance types, regardless of the AMI used.

What changed in the new generation that could explain this silent connection timeout or TCP idle behavior? Are there tunable parameters (e.g., ENA, Nitro, connection tracking, or kernel-level timeouts) that can restore the same WebSocket stability seen on r7i/r7iz instances?

1 Answer
-1

The behavior you're experiencing on r8i/r8i-flex/m8i instances appears to be related to connection tracking and TCP idle timeout settings.

In AWS, TCP connections are tracked and will be closed if they remain idle (no data transmitted) for a specific period. For EC2 instances, the default TCP established idle timeout is 432,000 seconds (5 days), but this can vary based on instance types and configurations.

The consistent ~5 minute (approximately 300 seconds) disconnection you're seeing suggests that a TCP idle timeout is being enforced somewhere in your connection path. This could be due to differences in the underlying network infrastructure or default connection tracking configurations between the r7i and r8i/m8i instance families.

To address this issue, you have several options:

  1. Implement WebSocket ping/pong frames or application-level heartbeats to keep the connection active by sending data periodically (before the 5-minute mark).

  2. Adjust the TCP established timeout for your instances. You can modify the tcpEstablishedTimeout parameter in the ConnectionTrackingConfiguration for your Elastic network interfaces. This value can be set between 60 seconds and 432,000 seconds (5 days).

  3. Check if there are any differences in the network configuration between your instance generations that might be affecting connection tracking.

For WebSocket applications, it's generally a good practice to implement heartbeat mechanisms regardless of infrastructure, as various components in a network path might have different idle timeout policies.
Sources
ConnectionTrackingConfiguration - Amazon Elastic Compute Cloud
ConnectionTrackingSpecification - Amazon Elastic Compute Cloud
Introducing NLB TCP configurable idle timeout | Networking & Content Delivery

answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.