Skip to content

CodeBuild GitHub runners are randomly failing

2

This is about the self-hosted runner integration in CodeBuild: https://docs.aws.amazon.com/codebuild/latest/userguide/action-runner.html

This worked well for a month or so.

And for the past several days a most of the workflow runs are getting stuck and webhooks are not picked up. Sometimes CB executions get triggered, and CB runs, but there are no logs in GH and CB runs for a long time, failing in the end, and wasting CI minutes.

This only started a few days ago, and we did not change anything on our end.

The only thing I can think of is a bad release that introduced a bug, which is very likely given the recent announcement of supporting org webhook events. It started happening the same day or day before.

https://aws.amazon.com/about-aws/whats-new/2024/06/aws-codebuild-organization-global-github-webhooks/

Is anyone else experiencing this?

asked 2 years ago1.2K views
6 Answers
1

My issue was because I had 3 jobs in the same workflow, but I was explicitly defining an instance size for 2 of them and relying on the default for 1 of them.

So my workflow file was

jobs:
  job-1:
    runs-on:
       - codebuild-myproject--${{ github.run_id }}-${{ github.run_attempt }}
       - instance-size:large
  job-2:
    runs-on:
       - codebuild-myproject--${{ github.run_id }}-${{ github.run_attempt }}
       - instance-size:large
  job-3:
    runs-on:
       - codebuild-myproject--${{ github.run_id }}-${{ github.run_attempt }}

My default instance size was set to small, so what was happening was that 2 large instances and 1 small instance would get started. Sometimes job 3 would take a large instance, then job 2 or 1 would get stuck because the only available instance was a small.

I was able to solve it by explicitly listing an instance size for each job in the workflow.

jobs:
  job-1:
    runs-on:
       - codebuild-myproject--${{ github.run_id }}-${{ github.run_attempt }}
       - instance-size:large
  job-2:
    runs-on:
       - codebuild-myproject--${{ github.run_id }}-${{ github.run_attempt }}
       - instance-size:large
  job-3:
    runs-on:
       - codebuild-myproject--${{ github.run_id }}-${{ github.run_attempt }}
       - instance-size:small
answered a year ago
  • Thank you! This seems to be my case and the suggestion seems to help!

1

Can someone from the AWS CodeBuild service team please look into this.

I can reproduce this bug consistently.

Here's a repro case:

  • Setup a GitHub Actions workflow that triggers two jobs at the same time via a matrix
  • Use CodeBuild-based runners
  • Trigger a run by pushing or whatever

What happens then:

  • One of the job gets picked up and runs
  • The second job gets stuck "Waiting for a runner to pick up this job... " and it never completes
  • CodeBuild UI meanwhile shows no running jobs

My hunch is that there's a queue that receives these webhook events, and it probably deduplicates the events by something like a project name, or some run ID, that is not supposed to be unique.

answered 2 years ago
0
Accepted Answer

Ok, thanks for confirming, AWS. I know when AWS says nothing, that means the bug is there, and they are working to fix it. 👍

answered 2 years ago
EXPERT
reviewed 2 years ago
0

I think I found the issue.

In the Webhook request history in GitHub I see:

{"message":"Cannot have more than 1 builds in queue for the account"}

Which is really weird, provided that I did not make any changes on my account, and I have had > 2 running jobs in parallel before.

And all of the quotas are set to 15.

answered 2 years ago
  • For this issue, you are likely hitting instance limits for your account. You will likely need to cut a ticket to CodeBuild to increase your account level limits

0

Yep, I started experiencing the same issue yesterday. I think it started happening the moment I introduced multiple GHA jobs starting concurrently. I see jobs starting in GH and runners starting in CB, but then it looks like some of the jobs won't ever get connected to the runner. Not seeing any issues in the Webhook request history - looks like 200s everywhere and yet the builds get stuck.

answered a year ago
0

Just started experimenting with CodeBuild GitHub Runners and saw this issue immediately. We do have multiple jobs that get started concurrently and one of them will always randomly hang with "Waiting for a runner to pick up this job..." in GitHub and in CodeBuild it will be stuck on "Listening for Jobs".

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.