How do I troubleshoot connectivity issues when I use a NAT gateway on my private Amazon VPC?

4 minute read
1

I want to troubleshoot connectivity issues when I use a NAT gateway on my private Amazon Virtual Private Cloud (Amazon VPC).

Short description

Resources that are in private subnets might experience connectivity timeouts, sudden connection drops, or slow connectivity for the following reasons:

  • Network access control list (network ACL) rule restrictions on ephemeral port ranges
  • ErrorPortAllocation error on the NAT gateway
  • Port exhaustion on the client instance
  • IdleTimeoutCount error because of idle connections
  • NAT gateway bandwidth limitations

Resolution

Network ACL rule restrictions on ephemeral port ranges

Verify that the network ACL that you associated with the public subnet of the NAT gateway allows traffic from the ephemeral port range, 1024-65535.

If the network ACL allows only a subset of the range and a client uses a port outside the range, then traffic drops. For more information, see Example: VPC with servers in private subnets and NAT.

ErrorPortAllocation error on the NAT gateway

Each NAT gateway supports up to 55,000 simultaneous connections to each destination. If the connections exceed the threshold, then new connections to the destination fail and the ErrorPortAllocation metric for the NAT gateway increases in Amazon CloudWatch.

To resolve this issue, take the following actions:

  • Associate one primary and up to seven secondary IPv4 addresses to your NAT gateways.
  • Add secondary IPv4 addresses to increase the number of available ports and expand the number of concurrent connections.

Note: Secondary IPv4 addresses increase the number of available ports, so the number of concurrent connections to a NAT gateway that workloads can use also increases.

For more information, see How do I resolve the "ErrorPortAllocation" error on my NAT gateway in Amazon VPC?

Port exhaustion on the client instance

The client instances that are in the private subnet might have reached their operating system (OS) connection quotas.

To check the number of active connections, run the following commands:

Linux:

netstat -ano | grep ESTABLISHED | wc --l

netstat -ano | grep TIME_WAIT | wc --l

Windows:

netstat -ano | find /i "estab" /c

netstat -ano | find /i "TIME_WAIT" /c

If the output is close to the allowed local port range, then port exhaustion might be the cause.

To reduce port exhaustion, take the following actions:

  • Resolve application-level issues that drain the available connections.
  • Run the following command to increase the OS ephemeral port range:
    net.ipv4.ip_local_port_range = 1025 61000

Note: A broader port range might not resolve port allocation issues because of silent connection closures.

IdleTimeoutCount error because of idle connections

A NAT gateway times out connections that are idle for 350 seconds or more and causes the IdleTimeoutCount metric to spike. The NAT gateway then sends an TCP Reset (RST) packet, not a TCP Finish (FIN), to clients that try to resume the timed-out connection.

To resolve the IdleTimeoutCount error, take the following actions:

  • Review the IdleTimeoutCount metric in Amazon CloudWatch to identify idle connections.
  • Use CloudWatch Contributor Insights to view what's causing clients to remain in the Idle state.
  • Close idle connections from clients to release capacity.
  • Initiate more frequent traffic over a long-running connection.
  • Turn on TCP keepalive on the client instance with a value that's less than 350 seconds.

NAT gateway bandwidth limitations

A NAT gateway starts at 5 Gbps of bandwidth and automatically scales up to 100 Gbps. If the combined network throughput across all instances that use the NAT gateway reaches 100 Gbps, then traffic slows down. For more information, see NAT gateway metrics and dimensions.

To resolve a bandwidth limitation from the NAT gateway, distribute traffic across multiple NAT gateways in separate subnets.

For more information, see How can I use Amazon CloudWatch metrics to identify NAT gateway bandwidth issues?

Related information

How do I resolve intermittent connection issues when using a NAT instance?

Troubleshoot NAT gateways

AWS OFFICIAL
AWS OFFICIALUpdated 3 months ago
2 Comments

The Linux command above doesn't work... needs a line break or something to look like this:

netstat -ano | grep ESTABLISHED | wc --l

netstat -ano | grep TIME_WAIT | wc --l

replied a year ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied a year ago