AWS Batch: Fargate container stuck

0

Hello,

I have an AWS batch job running on fargate monitored with cloudwatch container insights that gets stuck for more than 12 hours.

Locally on my laptop's i9 the exact same job only takes 1h. My job is configured to use 4 vcpus and 30g of RAM.

Container insights is not showing the container using anywhere near the max resource usage. It peaks at around 2g of RAM and then gets stuck.

I'm trying to troubleshoot if this is specific to AWS batch or maybe a combination of my workload plus batch.

Are there any other things I could do to troubleshoot this?

Thanks.

已提問 2 年前檢視次數 301 次
1 個回答
1

Use cloudwatch to log the process and identify the step that cause the performance degradation.

Also you can use ECS Exec:

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-exec.html

To check what is going on during that 12 hours on your Fargate container.

Your process use any hardware acceleration feature like GPU?

已回答 2 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南