AWS Batch - Cloudwatch Logs operation error

0

Hello,

Does anyone know why AWS Batch jobs can't start anymore since this morning around 6AM UTC in Frankfurt region?

I get this as the reason in AWS Batch job attempt details:

Reason

CannotStartContainerError: Error response from daemon: failed to create task for container: 
failed to initialize logging driver: failed to create Cloudwatch log stream: operation error 
CloudWatch Logs: CreateLogStream, https response error StatusCode: 400
failed to create Cloudwatch log stream: operation error

Any ideas what operation error means?

There is no info on what is wrong exactly. I can create Cloudwatch Log stream manually or from ECS so it's not a quota limit issue. It's only AWS Batch jobs that get the above 400 error about cloudwatch logs.

Thank you for any help.

Radu
gefragt vor 2 Monaten370 Aufrufe
2 Antworten
1
Akzeptierte Antwort

The problem was the underlying EC2 instance. It stopped operating normally for some reason. After terminating it and starting a new instance for the batch cluster the jobs started running again and the log stream were being created and populated with events.

Radu
beantwortet vor 2 Monaten
profile picture
EXPERTE
überprüft vor 2 Monaten
0

It sounds like the issue is related to throttling errors when creating CloudWatch log streams from AWS Batch jobs. Can you check on these ?

  • Verify that the IAM role used by the AWS Batch job has permissions to create log streams in CloudWatch Logs.
  • Check if you are hitting the CloudWatch Logs throttling limits. Each AWS account has a limit on the number of log events it can ingest per second across all log groups. Batch jobs creating logs concurrently could exceed this limit.
  • Try increasing the logging retries configuration in the AWS Batch job definition. This will make the job retry creating log streams on throttling errors before failing.
 "containerProperties": {
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": {
      "awslogs-group": "/aws/batch/job", 
      "awslogs-region": "us-east-1",
      "awslogs-stream-prefix": "job",
      "awslogs-create-group": "true",
      "awslogs-retries": "5" 
    }
  }
}

profile picture
EXPERTE
beantwortet vor 2 Monaten
    • Is not a permission error because very few jobs do manage to start successfully.

    • How can I see if I'm hitting the CloudWatch Logs throttling limits? Also, afaik throttling errors have a specific error message and not operation error—something like ThrottlingException: Rate exceeded status code: 400.

    • It looks like you can't set retries for awslogs log driver: Log driver awslogs disallows options: awslogs-retries

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen

Relevanter Inhalt