2 Respuestas
- Más nuevo
- Más votos
- Más comentarios
1
The problem was the underlying EC2 instance. It stopped operating normally for some reason. After terminating it and starting a new instance for the batch cluster the jobs started running again and the log stream were being created and populated with events.
respondido hace 2 meses
0
It sounds like the issue is related to throttling errors when creating CloudWatch log streams from AWS Batch jobs. Can you check on these ?
- Verify that the IAM role used by the AWS Batch job has permissions to create log streams in CloudWatch Logs.
- Check if you are hitting the CloudWatch Logs throttling limits. Each AWS account has a limit on the number of log events it can ingest per second across all log groups. Batch jobs creating logs concurrently could exceed this limit.
- Try increasing the logging retries configuration in the AWS Batch job definition. This will make the job retry creating log streams on throttling errors before failing.
"containerProperties": {
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/aws/batch/job",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "job",
"awslogs-create-group": "true",
"awslogs-retries": "5"
}
}
}
Contenido relevante
- OFICIAL DE AWSActualizada hace 2 años
- OFICIAL DE AWSActualizada hace 2 años
- OFICIAL DE AWSActualizada hace un año
- OFICIAL DE AWSActualizada hace 2 años
Is not a permission error because very few jobs do manage to start successfully.
How can I see if I'm hitting the CloudWatch Logs throttling limits? Also, afaik throttling errors have a specific error message and not
operation error
—something likeThrottlingException: Rate exceeded status code: 400
.It looks like you can't set retries for awslogs log driver:
Log driver awslogs disallows options: awslogs-retries