Long running java application on Fargate quit without exception:

0

Hi, I have a long running java program on fargate with 1024 unit cpu (1cpu) and 4G ram. serving as one of the step in step functions workflow. The java program has

  • An infinite loop pulling message from a SQS queue
  • download a 17G .tar.gz file using Transfermanager from S3 from another account (only 3+ minutes)
  • uncompress the .tar.gz file which will create a 50 G file and a small .md5 file with checksum value within.
  • verify the checksum of 50G file against the content from .md5 file. (about 7 minutes)
  • upload the untar-ed file to another S3 through multipart upload (upload multipart and then combine) which took around 12 mins.

The problem is that, based on my log, I see the program just stopped without exception sometimes during un-tar, sometimes during upload. I am just guessing there could be a system level issue (OS) or JVM died. Can somebody please share your thought of

  • ways of debugging it?
  • Guess the problem(s) or any other suggestions? Thanks you.
  • Could you provide the task error code? Should be helpful for an initial indication of what went wrong.

asked 2 years ago434 views
2 Answers
0
Accepted Answer

You can describe the details of the stopped task by using the command shown below. This will provide insights into the stopCode and stoppedReason values which can be used to determine the reason for the failure.

aws ecs describe-tasks --tasks <task-id> --cluster <cluster-name>

In the same output, you can review the containers section and retrieve the exitCode of the container process. This will be very useful in determining the reason for the container process termination.

If the container was terminated due to an OOM issue, the exitCode would be 137.

You can also setup Cloud Watch Container Insights metrics for gathering task level resource metrics to identify any issues with system level resources.

I hope this is helpful to you. Please let me know if you need more clarification on this information.

profile pictureAWS
SUPPORT ENGINEER
answered 2 years ago
0

After looking at the fargate task metrics, I found the 100% CPU usage. Added one more CPU made everything running smoothly.

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions