Salta al contenuto

ECS Fargate Spot ignores stopTimeout

0

As per the docs, prior to being spot interrupted the container receives a SIGTERM signal, and then has up to stopTimeout (max at 120), before the container is force killed.

However, my Fargate Spot task was killed after only 21 seconds despite having stopTimeout: 120 configured.

Task Definition:

"containerDefinitions": [
    {
        "name": "default",
        "stopTimeout": 120,
        ...
    }
]

Application Logs Timeline:

18:08:30.619Z: "Received SIGTERM" logged by my application  
18:08:51.746Z: Process killed with SIGKILL (exitCode: 137)

Task Execution Details:

"stopCode": "SpotInterruption",
"stoppedReason": "Your Spot Task was interrupted.",
"stoppingAt": "2025-06-06T18:08:30.026000+00:00",
"executionStoppedAt": "2025-06-06T18:08:51.746000+00:00",
"exitCode": 137

Delta: 21.7 seconds (not 120 seconds)

The container received SIGKILL (exitCode: 137) after only 21 seconds, completely ignoring the configured stopTimeout: 120.

Is this documented behavior? Should stopTimeout be ignored during Spot interruptions, or is this a bug?

posta 5 mesi fa198 visualizzazioni
2 Risposte
0
Risposta accettata

SOLVED: This was my mistake, not AWS behavior

After digging deeper into this issue, I discovered that AWS was correctly respecting my stopTimeout: 120 configuration. The early termination was caused by my own container command configuration.

Root Cause: timeout Command Kill-After Logic

My container was using this command, since ECS does not support setting max execution time:

timeout -k 10s 3600 python ./main.py

The -k 10s parameter was the culprit. Here's what actually happened:

  1. AWS sent SIGTERM to my container during spot interruption (correctly)
  2. timeout process received SIGTERM and forwarded it to my Python script
  3. timeout immediately started its own 10-second kill timer due to -k 10s
  4. After 10 seconds, timeout sent SIGKILL to my Python script
  5. Process terminated with exit code 137

The Technical Details

The GNU timeout command's signal handler doesn't distinguish between internal timeouts and external signals. When it receives any signal (including external SIGTERM from ECS), it triggers the kill-after logic if the -k parameter is specified.

From the timeout source code:

static void cleanup (int sig) {
  if (0 < monitored_pid) {
    if (kill_after) {  // My -k 10s parameter
      settimeout (kill_after, false);  // Starts 10s kill timer!
    }
    send_sig (monitored_pid, sig);  // Forwards signal to child
  }
}

Solution

I fixed this by updating my container command to:

timeout -k 120s 3600 python ./main.py

This allows my application the full 120 seconds for graceful shutdown, matching my ECS stopTimeout configuration.

Key Takeaway

AWS Fargate correctly respects stopTimeout during spot interruptions. If you're seeing early termination, check your application's signal handling and any wrapper commands that might be interfering with the graceful shutdown process.

con risposta 5 mesi fa
ESPERTO
verificato 5 mesi fa
0

The behavior you're experiencing with Fargate Spot tasks being terminated before the configured stopTimeout value is not explicitly documented as expected behavior.

According to the documentation, when using Fargate Spot capacity, tasks can be interrupted with a two-minute warning when AWS needs to reclaim the capacity. During this process, a stopTimeout value of up to 120 seconds can be specified in the container definition to allow containers to exit gracefully before being forcefully stopped. The default stopTimeout value is 30 seconds if not specified.

For the container to properly utilize this grace period:

  1. The SIGTERM signal must be received and processed within the container
  2. The container must implement cleanup actions in response to this signal

The documentation states that failure to process the SIGTERM signal will result in the task receiving a SIGKILL signal after the configured stopTimeout, which should be up to 120 seconds as you've configured.

Your logs show that your application did receive the SIGTERM signal, but was killed after only 21 seconds instead of the configured 120 seconds. This appears to be inconsistent with the documented behavior, where your container should have had the full 120 seconds to gracefully shut down before receiving SIGKILL.

This could potentially be a bug or undocumented limitation with how Fargate Spot handles task termination during spot interruptions. Since your application is correctly logging the receipt of SIGTERM but not getting the full configured grace period, this seems to deviate from the expected behavior described in the documentation.
Sources
ContainerDefinition - Amazon Elastic Container Service
Amazon ECS task definition parameters for the Fargate launch type - Amazon Elastic Container Service
Amazon ECS clusters for Fargate - Amazon Elastic Container Service

con risposta 5 mesi fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.