Skip to content

Why do I have a "Network allowance exceeded" metric in my self-managed ElastiCache cluster?

5 minute read
0

I see a "Network allowance exceeded" metric in my Amazon ElastiCache environment.

Short description

When the application workload exceeds network capabilities of the underlying ElastiCache node, traffic shaping can occur. To track traffic shaping, use the following metrics:

  • NetworkBandwidthInAllowanceExceeded and NetworkBandwidthOutAllowanceExceeded
  • NetworkPacketsPerSecondAllowanceExceeded
  • NetworkConntrackAllowanceExceeded

Resolution

NetworkBandwidthInAllowanceExceeded and NetworkBandwidthOutAllowanceExceeded

The NetworkBandwidthInAllowanceExceeded and NetworkBandwidthOutAllowanceExceeded metrics track the number of network packets that ElastiCache shapes when the throughput exceeds the aggregated bandwidth limit.

When you review NetworkBandwidthInAllowanceExceeded and NetworkBandwidthOutAllowanceExceeded, you must also review the NetworkBytesIn and NetworkBytesOut metrics in Amazon CloudWatch. When the CloudWatch bandwidth usage metrics NetworkBytesIn and NetworkBytesOut are below the node-level limits, the network performance metrics might show that ElastiCache exceeded an allowance. For more information, see Monitor instance bandwidth.

Note: Small bursts of traffic can cause traffic shaping, even if your average bandwidth is within your limits. If there are occasional spikes in these bandwidth allowance metrics with no effects on the application side, then no further action is required. Because Valkey and Redis OSS use TCP, TCP retransmits dropped packets.

If these bandwidth allowance metrics are consistently high and your application sees latency issues, then review the timestamps of the latency issues. If the error timestamps match the times of the metric spikes, then scale up your cluster. For more information, see Scaling self-designed clusters.

Also, review the cache node type for the cluster. If the application workload constantly bursts network usage beyond the baseline bandwidth, then you might get traffic shaping. For more information, see Available instance bandwidth.

Note: For every byte that ElastiCache writes to the primary node, ElastiCache replicates the same information to all other replicas. When the cluster tries to process the replication backlog, clusters with small node types, multiple replicas, and intensive write requests might have issues. This backlog can lead to high NetworkBandwidthOutAllowanceExceeded values on the primary nodes.

To determine what caused a spike in metrics on the application side, look for commands that operate on multiple keys. This includes the following examples:

  • MGET
  • MSET
  • HGETALL

If you work with multiple large keys, such as large JSON objects or hash values, then you might exceed the bandwidth limits of your node type. ElastiCache then either drops the excess traffic or adds it to a queue based on the current load.

NetworkPacketsPerSecondAllowanceExceeded

If this metric is a value other than 0, then the network usage on the underlying cache nodes crossed the packets per second (PPS) limit. This limit is specific to the node type that you use. ElastiCache drops or queues the excess packets that exceed the node limit.

For applications that drive high queries per second (QPS) of small requests, the node might cross the PPS limits. To determine the rate of command executions, use CloudWatch metrics for each Command data type. For more information, see Metrics for Valkey and Redis OSS.

To resolve this issue, you can temporarily scale up the cluster to a bigger node type. If the operations are read-heavy, then you can add more read replicas to the cluster or shard to spread the load. For cluster mode enabled (CME) clusters, if the operations are write-heavy, then add more shards to scale out the cluster.

Note: For cluster mode disabled (CMD) clusters, you must move your cluster to a bigger node to scale write operations.

NetworkConntrackAllowanceExceeded

If this metric is a value other than 0, then ElastiCache exceeded the maximum number of connections tracked across all node's security groups. After you reach the connection limit, new connections fail until ElastiCache closes the existing connections. For more information, see Amazon Elastic Compute Cloud (Amazon EC2) security group connection tracking.

When workloads create a large number of network connections and you don't properly close the connections, then the connections remain open on the server side. These connections remain idle and lead to conntrack exhaustion. Then, ElastiCache can't accept new connections until ElastiCache closes the existing connections.

To resolve this issue in ElastiCache, use the timeout parameter to review and modify idle connection timeout. Configure the parameter for the ElastiCache server to close connections after an idle period.

To resolve this issue on the application side, use the connection pool settings to reuse existing connections and reduce the rate of NewConnections. For more information, see Best practices: Redis clients and Amazon ElastiCache for Redis.

Note: Both the timeout parameter on the server side and the Connection Pool settings on the client side manage connection establishment and requires load testing from application. Load test your application to determine the optimal configuration for both timeout and connection pool settings.

Related information

Monitoring use with CloudWatch Metrics

Scaling ElastiCache

Network

Redis OSS 2.6.13 parameters

Amazon EC2 instance-level network performance metrics uncover new insights

AWS OFFICIALUpdated 10 months ago