EC2 outbound bandwidth dropped on July 21st and stays very low (eu-west-1)

0

Hello,

In our organization, we run our CICD using gitlab. We spawn runners on demand and each runner is a EC2 instance. Since July 21st, the uploads performed by these runners got their duration multiplied by ~10.

From our tests, we see that:

  • outbound bandwidth from any instance type and any AMI we can spawn in eu-west-1 is now <2,5Mb/s (~300KB/s),
    • we tested t3.medium and m6a.large,
    • we tested various AMIs (ubuntu, amazon linux),
    • this seems far from any quota/limit/advertised bandwidth,
  • before July 21st, it was around 25Mb/s (~3MB/s) for t3.medium runners,
  • download speed is 500Mb/s,
  • upload/download target is the gitlab server host itself,
    • located in Paris,
    • can be reached through 2 different physical connections and we get very good speed from other places,
  • all speed tests mentioned here are performed using iperf3,
  • runners are in a VPC, routed through an Internet Gateway,
  • if we spawn a runner in eu-west-3, outbound speed is 24Mb/s (and download is 1Gb/s),
    • that's not great, but at least it’s 10 times what we get in eu-west-1 (and close to what we had before “the incident”).
  • we haven’t changed a thing regarding our configuration, runners, gitlab version, code. The problem happened suddenly for pipelines run after July 21st, 4pm cest,

We may consider spawning our runners in eu-west-3, or even move our gitlab server on EC2, but since 1) the problem happened suddenly without any action on our side and 2) this would require moving other resources as well, time, effort and cost, we’d rather prefer a logic explanation of what’s happening here before taking actions.

We are looking for clues to understand:

  • how to monitor this in order to identify if we actually reach a limit/quota,
  • why this changed suddenly,
  • why outbound bandwidth is that low (is that to be expected, or not).

Any help investigating would be greatly appreciated!

Best,

EDIT: adding MTR

HOST: runner-5nsscsoo-project-32- Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ip-172-17-0-1.eu-west-1.c  0.0%    10    0.0   0.1   0.0   0.1   0.0
  2.|-- ec2-3-248-240-211.eu-west  0.0%    10    4.0  11.9   0.7  72.0  21.3
  3.|-- 240.1.116.13               0.0%    10    0.3   0.3   0.3   0.5   0.1
  4.|-- 240.1.116.30               0.0%    10    0.3   0.3   0.2   0.3   0.0
  5.|-- 240.1.116.8                0.0%    10    0.3   0.3   0.3   0.3   0.0
  6.|-- 240.1.108.8                0.0%    10    0.3   0.3   0.3   0.4   0.0
  7.|-- 240.1.108.22               0.0%    10    0.3   0.3   0.3   0.4   0.0
  8.|-- 240.1.108.3                0.0%    10    0.3   0.3   0.3   0.4   0.0
  9.|-- 242.3.201.129              0.0%    10    0.7   0.7   0.3   2.1   0.5
 10.|-- 100.95.19.135              0.0%    10    1.1   3.0   0.2  16.0   5.4
 11.|-- 100.100.20.32              0.0%    10    0.5   2.7   0.3  13.2   4.0
 12.|-- 100.91.209.9               0.0%    10   11.0  12.9  10.9  25.0   4.5
 13.|-- 100.100.6.57               0.0%    10  119.7  21.6  10.5 119.7  34.5
 14.|-- 100.100.81.134             0.0%    10   12.1  10.7  10.5  12.1   0.5
 15.|-- 100.100.81.133             0.0%    10   11.4  14.4  11.3  41.4   9.5
 16.|-- 100.100.4.104              0.0%    10   11.5  11.9  11.4  14.8   1.0
 17.|-- 195.66.226.94              0.0%    10   17.9  17.8  17.7  17.9   0.1
 18.|-- cbv-pa7-n5k2-core-01.cele  0.0%    10   17.7  17.7  17.6  17.8   0.1
 19.|-- par-gsp-n5k1-pe-01.celest  0.0%    10   16.9  16.8  16.7  17.0   0.1
 20.|-- REDACTED.in-addr.a         0.0%    10   16.3  16.4  16.3  16.5   0.0
  • Are the servers in a public subnet or in a private subnet via a NAT? If its a NAT backed by an instance is that showing any issues? If its a T class does it still have compute credits left? What does an MTR look like for the remote host you are testing against? Large hop times, packet loss, high RTT?

  • Thanks for your answer.

    The subnet is configured to auto-assign public IPv4 address, and uses an IGW. Does that answer your 1st question?

    The CPU credit balance is 0 with only one point of measure. But instances are spawned on demand by gitlab, how can they start with 0? Note that we had the same issue with a fresh m6a.large instance (that has no compute credits, right?)

    MTR in next comment

  • New info: the credit balance increases with time, but the upload is still slow (no change at all).

  • It's impossible to post a MTR in comments (blocked by rePost), so I've added it at the end of the question. Note: it's the MTR from the instance (to the server)

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions