Skip to content

SSM Port Forwarding tunnel silently dies after hours -- plugin fails to reconnect after periodic WebSocket recycling

0

Environment:

SSM Agent: 3.3.x (on-prem hybrid instance, Amazon Linux 2023)
EC2 Security Group: Allow all outbound traffic
session-manager-plugin: latest (on EC2 t4g.micro, AL2023)
Region: us-west-2
Document: AWS-StartPortForwardingSession
Target: Hybrid managed instance (mi-*)
Idle timeout: 20m + ResumeSession
Max Session Timeout: Not set

Description:

We're running long-lived SSM port forwarding sessions (12-24h) from an EC2 instance to an on-premise hybrid managed instance. The tunnel works correctly for hours (we use ResumeSession to be able to hanlde idleTimeout), but eventually dies silently -- all traffic starts timing out with no response.

Root cause from logs :

The SSM service closes WebSocket connections approximately every 60 minutes, sending websocket: close 1000 (normal): Bye. Both the on-prem agent and the EC2-side session-manager-plugin attempt to reconnect. The on-prem agent consistently reconnects successfully. The session-manager-plugin usually reconnects too, but occasionally fails.

When the plugin fails to reconnect, the on-prem agent receives "Session is already terminated" when trying to recreate the data channel -- meaning the EC2 side has already given up and killed the session*

On-prem agent log -- successful reconnections every ~60 min (same session, same pattern):

2026-03-28 13:27:39 WARN [pluginName=Port] Reach the retry limit 5 for receive messages. Error: websocket: close 1000 (normal): Bye
2026-03-28 14:27:41 WARN [pluginName=Port] Reach the retry limit 5 for receive messages. Error: websocket: close 1000 (normal): Bye
2026-03-28 15:27:43 WARN [pluginName=Port] Reach the retry limit 5 for receive messages. Error: websocket: close 1000 (normal): Bye
2026-03-28 16:27:44 WARN [pluginName=Port] Reach the retry limit 5 for receive messages. Error: websocket: close 1000 (normal): Bye

All of the above resulted in successful reconnections -- tunnel continued working.

On-prem agent log -- the fatal disconnect (same pattern, but EC2 side failed):

2026-03-28 16:57:51 WARN  [pluginName=Port] Reach the retry limit 5 for receive messages. Error: websocket: close 1000 (normal): Bye
2026-03-28 16:57:51 INFO  [pluginName=Port] The session was cancelled
2026-03-28 16:57:51 ERROR [pluginName=Port] Unable to read from connection: use of closed network connection
2026-03-28 16:57:51 ERROR [pluginName=Port] Unable to accept stream: io: read/write on closed pipe
2026-03-28 16:57:52 INFO  [pluginName=Port] Setting task to cancelled as session is already terminated
2026-03-28 16:57:52 ERROR [pluginName=Port] CreateDataChannel failed: Session is already terminated

Sessions testing:

DurationTunnel survived?
1h, 3h, 6hYes, 0% failure
12hNo, died at ~5 hours
24hNo, died at ~19 hours
asked 2 months ago50 views
1 Answer
0

The behavior you're experiencing is related to how AWS Systems Manager Session Manager handles WebSocket connections and session resumption. Here's what's happening:

WebSocket Recycling and Session Resumption: The idle session timeout timer resets when Session Manager receives client-side inputs, including session reconnection (ResumeSession). These reconnections can occur due to network interruptions, browser tab management, or WebSocket disconnections. Because these events reset the idle timer, a session might remain active longer than the configured timeout period even without direct terminal commands.

The Silent Failure Pattern: Your logs show that the WebSocket connections are being closed approximately every 60 minutes with a normal closure (close 1000). Both sides attempt to reconnect, but when the EC2-side session-manager-plugin fails to reconnect successfully, the session becomes terminated on the service side. When your on-prem agent then tries to recreate the data channel, it receives "Session is already terminated" because the session has already been ended.

Why This Happens Intermittently: The issue is that while ResumeSession can handle idle timeouts and normal WebSocket recycling, there's a race condition or timing issue where the plugin occasionally fails to reconnect before the session is considered terminated. Once a session is terminated, it cannot be resumed - the connection is permanently ended and the data channel is closed.

Recommendations:

  1. Consider implementing session monitoring and automatic restart logic in your application layer to detect when the tunnel dies and establish a new session
  2. If you need strict control over session duration, you can use the Maximum session duration setting to enforce limits, though this won't solve the reconnection issue
  3. Monitor for connection failures and implement retry logic at the application level rather than relying solely on the plugin's reconnection mechanism

The fundamental issue is that long-lived port forwarding sessions (12-24 hours) are experiencing occasional reconnection failures during the periodic WebSocket recycling, and once the session is terminated, it cannot be recovered.
Sources
Specify an idle session timeout value - AWS Systems Manager
terminate_session - Boto3 1.42.50 documentation

answered 2 months ago
EXPERT
reviewed 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.