Long running java application on Fargate quit without exception:

0

Hi, I have a long running java program on fargate with 1024 unit cpu (1cpu) and 4G ram. serving as one of the step in step functions workflow. The java program has

  • An infinite loop pulling message from a SQS queue
  • download a 17G .tar.gz file using Transfermanager from S3 from another account (only 3+ minutes)
  • uncompress the .tar.gz file which will create a 50 G file and a small .md5 file with checksum value within.
  • verify the checksum of 50G file against the content from .md5 file. (about 7 minutes)
  • upload the untar-ed file to another S3 through multipart upload (upload multipart and then combine) which took around 12 mins.

The problem is that, based on my log, I see the program just stopped without exception sometimes during un-tar, sometimes during upload. I am just guessing there could be a system level issue (OS) or JVM died. Can somebody please share your thought of

  • ways of debugging it?
  • Guess the problem(s) or any other suggestions? Thanks you.
  • Could you provide the task error code? Should be helpful for an initial indication of what went wrong.

已提問 2 年前檢視次數 446 次
2 個答案
0
已接受的答案

You can describe the details of the stopped task by using the command shown below. This will provide insights into the stopCode and stoppedReason values which can be used to determine the reason for the failure.

aws ecs describe-tasks --tasks <task-id> --cluster <cluster-name>

In the same output, you can review the containers section and retrieve the exitCode of the container process. This will be very useful in determining the reason for the container process termination.

If the container was terminated due to an OOM issue, the exitCode would be 137.

You can also setup Cloud Watch Container Insights metrics for gathering task level resource metrics to identify any issues with system level resources.

I hope this is helpful to you. Please let me know if you need more clarification on this information.

profile pictureAWS
支援工程師
已回答 2 年前
0

After looking at the fargate task metrics, I found the 100% CPU usage. Added one more CPU made everything running smoothly.

已回答 2 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南