Elasticache shows network in and out as exceeded, but how?

0

I have a small redis instance running in Elasticach with one shard running with clustering mode off.

It is showing that we have exceeded our network bandwidth in and out:

Enter image description here

It also shows that we're barely using our network, with peaks below 30Mbps:

Enter image description here

That's not much bandwidth that we're using, but to confirm what we have available, I went to the pricing page, which says that my instance (cache.t4g.small) provides "Up to 5 Gigabit" of network performance. Seems like 30Mbps is a lot less than 5Gbps!

Is this an error in the metrics reporting or am I missing something? Should I be worried about the exceeded bandwidth problems? If so, I don't have many options. The next bigger instance that provides more network performance is a cache.m6g.large, which costs about 10× what I'm paying now!

I'm not doing anything crazy with this cache. I'm surprised I'm running into all this.

Edit, after scale-up

Per comments, I scaled up Elasticache to the next size bigger, which has twice the baseline network capacity. I hoped this would remove the exceeded network problem or at least halve it, but it seems to have made no impact at all. Here's the latest charts. The spike in the middle is the scale-up event (when the entire DB was copied to the larger instance), so left of the spike is the small instance, right of it is the larger one:

Enter image description here

asked 2 years ago22196 views
2 Answers
1

Hello,

I don't see any major issues with the graphs you're sharing here because there are no plateaus in the graphs, they're all spikes. Are you seeing any performance issues with the application that's actively querying these cache clusters or are you just concerned with the metrics themselves?

A things I'd like to note about the graphs:

  • The "Network Bytes In" and "Network Bytes Out" or NetworkBytesIn/NetworkBytesOut is measured in Bytes and not Bits, so effectively you're using ~287Mbps Out and ~207Mbps In at the peak of those spikes listed together which makes ~500Mbps. The instance type you're using (cache.t4g.small) has a burstable rate of 5Gbps, but that's an amount that can be sustained over long periods of time. Per this Available instance bandwidth doc:

Typically, instances with 16 vCPUs or fewer (size 4xlarge and smaller) are documented as having "up to" a specified bandwidth; for example, "up to 10 Gbps". These instances have a baseline bandwidth. To meet additional demand, they can use a network I/O credit mechanism to burst beyond their baseline bandwidth. Instances can use burst bandwidth for a limited time, typically from 5 to 60 minutes, depending on the instance size.

  • The values you're seeing on the "Network Bandwidth In Allowance Exceeded" and "Network Bandwidth Out Allowance Exceeded" or NetworkBandwidthInAllowanceExceeded and NetworkBandwidthOutAllowanceExceeded graphs are individual packets that are dropped due to exceeding the inbound/outbound aggregate rates. Relevant Docs.

Because you're effectively using ~300-500Mbps at the peak of your spikes, some packets may be dropped due to the small size of the individual instance and the supported network throughput baseline. It would be beneficial to either spread the load out over multiple cache.t4g.small instances or upgrade the size of the current instance to a larger instance that supports additional network throughput.

If you're not seeing any performance degradation of the application that's querying the cluster, then you should be fine in the current configuration.

profile pictureAWS
EXPERT
Chris_G
answered 2 years ago
  • Thanks, this is very helpful. So now the question is, "What is the baseline bandwidth for various instance types?" I understand that it bursts up to 5Gbps, but if that's true a baseline of ~500Mbps seems off. That's a huge delta but I can't find what the "baseline" actually is. I also have been trying to see if going up to the next instance size (which also has "up to 5Gbps" network) would increase the baseline or not. The docs mention network credits, but there seems to be no way to see them in metrics. Finally, I'm having network problems, but they could be from something else. Thank you!

  • Ah ha! I found the baseline bandwidths here. They really are totally different from the "Up to" amounts, wow. It looks like the baseline for a t4g.small is 0.128Gbps, and for t4g.medium is 0.256Gbps, so hopefully scaling up will help. I just gave that a try. Will know soon if it worked. I'd still love to see a metric of network credits though, if such a thing exists.

  • Sorry to keep commenting. I scaled up the instance and it made no impact at all. I updated my question with the details. Thank you again for your thoughts.

0

There are two types of bandwidth limits - baseline limits and burst limits - and these limits vary by instance type - These are limits imposed by EC2 and not ElastiCache. There is a metric that calculates the bandwidth used (Bytes in/Bytes out are calculated separately) and basically based on this utilization, you will see the service throw the message - I would reach out to either support/your account rep to get additional details. Let me know if you have any questions, thx.

profile pictureAWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions