AWS Batch - Cloudwatch Logs operation error

0

Hello,

Does anyone know why AWS Batch jobs can't start anymore since this morning around 6AM UTC in Frankfurt region?

I get this as the reason in AWS Batch job attempt details:

Reason

CannotStartContainerError: Error response from daemon: failed to create task for container: 
failed to initialize logging driver: failed to create Cloudwatch log stream: operation error 
CloudWatch Logs: CreateLogStream, https response error StatusCode: 400
failed to create Cloudwatch log stream: operation error

Any ideas what operation error means?

There is no info on what is wrong exactly. I can create Cloudwatch Log stream manually or from ECS so it's not a quota limit issue. It's only AWS Batch jobs that get the above 400 error about cloudwatch logs.

Thank you for any help.

Radu
asked 2 months ago351 views
2 Answers
1
Accepted Answer

The problem was the underlying EC2 instance. It stopped operating normally for some reason. After terminating it and starting a new instance for the batch cluster the jobs started running again and the log stream were being created and populated with events.

Radu
answered 2 months ago
profile picture
EXPERT
reviewed a month ago
0

It sounds like the issue is related to throttling errors when creating CloudWatch log streams from AWS Batch jobs. Can you check on these ?

  • Verify that the IAM role used by the AWS Batch job has permissions to create log streams in CloudWatch Logs.
  • Check if you are hitting the CloudWatch Logs throttling limits. Each AWS account has a limit on the number of log events it can ingest per second across all log groups. Batch jobs creating logs concurrently could exceed this limit.
  • Try increasing the logging retries configuration in the AWS Batch job definition. This will make the job retry creating log streams on throttling errors before failing.
 "containerProperties": {
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": {
      "awslogs-group": "/aws/batch/job", 
      "awslogs-region": "us-east-1",
      "awslogs-stream-prefix": "job",
      "awslogs-create-group": "true",
      "awslogs-retries": "5" 
    }
  }
}

profile picture
EXPERT
answered 2 months ago
    • Is not a permission error because very few jobs do manage to start successfully.

    • How can I see if I'm hitting the CloudWatch Logs throttling limits? Also, afaik throttling errors have a specific error message and not operation error—something like ThrottlingException: Rate exceeded status code: 400.

    • It looks like you can't set retries for awslogs log driver: Log driver awslogs disallows options: awslogs-retries

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions