Long running java application on Fargate quit without exception:

0

Hi, I have a long running java program on fargate with 1024 unit cpu (1cpu) and 4G ram. serving as one of the step in step functions workflow. The java program has

  • An infinite loop pulling message from a SQS queue
  • download a 17G .tar.gz file using Transfermanager from S3 from another account (only 3+ minutes)
  • uncompress the .tar.gz file which will create a 50 G file and a small .md5 file with checksum value within.
  • verify the checksum of 50G file against the content from .md5 file. (about 7 minutes)
  • upload the untar-ed file to another S3 through multipart upload (upload multipart and then combine) which took around 12 mins.

The problem is that, based on my log, I see the program just stopped without exception sometimes during un-tar, sometimes during upload. I am just guessing there could be a system level issue (OS) or JVM died. Can somebody please share your thought of

  • ways of debugging it?
  • Guess the problem(s) or any other suggestions? Thanks you.
  • Could you provide the task error code? Should be helpful for an initial indication of what went wrong.

preguntada hace 2 años446 visualizaciones
2 Respuestas
0
Respuesta aceptada

You can describe the details of the stopped task by using the command shown below. This will provide insights into the stopCode and stoppedReason values which can be used to determine the reason for the failure.

aws ecs describe-tasks --tasks <task-id> --cluster <cluster-name>

In the same output, you can review the containers section and retrieve the exitCode of the container process. This will be very useful in determining the reason for the container process termination.

If the container was terminated due to an OOM issue, the exitCode would be 137.

You can also setup Cloud Watch Container Insights metrics for gathering task level resource metrics to identify any issues with system level resources.

I hope this is helpful to you. Please let me know if you need more clarification on this information.

profile pictureAWS
INGENIERO DE SOPORTE
respondido hace 2 años
0

After looking at the fargate task metrics, I found the 100% CPU usage. Added one more CPU made everything running smoothly.

respondido hace 2 años

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas