Sagemaker Batch Transform - "upstream prematurely closed connection" - Unable to serve requests that take longer than 30 minutes

1

This is a duplicate of a question I asked on stack overflow

I am serving a sagemaker model through a custom docker container using the guide that AWS provides. This is a docker container that runs a simple nginx->gunicorn/wsgi->flask server

I am facing an issue where my transform requests time out around 30 minutes in all instances, despite should being able to continue to 60 minutes. I need requests to be able to go to sagemaker maximum of 60 minutes due to data intense nature of request.


Through experience working with this setup for some months, I know that there are 3 factors that should affect the time my server has to respond to requests:

  1. Sagemaker itself will cap invocations requests according to the InvocationsTimeoutInSeconds paremeter set when creating the batch transform job.
  2. The nginx.conf file must be configured such that keepalive_timeout, proxy_read_timeout, proxy_send_timeout, and proxy_connect_timeout are all equal or greater than maximum timeout
  3. gunicorn server must its timeout configured to be equal or greater than maximum timeout

I have verified that when I create my batch transform job InvocationsTimeoutInSeconds is set to 3600 (1 hour)

My nginx.conf looks like this:

worker_processes 1;
daemon off; # Prevent forking


pid /tmp/nginx.pid;
error_log /var/log/nginx/error.log;

events {
  # defaults
}

http {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /var/log/nginx/access.log combined;

  sendfile        on;
  client_max_body_size 30M;
  keepalive_timeout  3920s;
  
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }

  server {
    listen 8080 deferred;
    client_max_body_size 80m;

    keepalive_timeout 3920s;
    proxy_read_timeout 3920s;
    proxy_send_timeout 3920s;
    proxy_connect_timeout 3920s;
    send_timeout 3920s;

    location ~ ^/(ping|invocations) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_pass http://gunicorn;
    }

    location / {
      return 404 "{}";
    }
  }
}`

I start the gunicorn server like this:

def start_server():
    print('Starting the inference server with {} workers.'.format(model_server_workers))
    print('Model server timeout {}.'.format(model_server_timeout))

    # link the log streams to stdout/err so they will be logged to the container logs
    subprocess.check_call(['ln', '-sf', '/dev/stdout', '/var/log/nginx/access.log'])
    subprocess.check_call(['ln', '-sf', '/dev/stderr', '/var/log/nginx/error.log'])

    nginx = subprocess.Popen(['nginx', '-c', '/opt/program/nginx.conf'])
    gunicorn = subprocess.Popen(['gunicorn',
                                 '--timeout', str(3600),
                                 '-k', 'sync',
                                 '-b', 'unix:/tmp/gunicorn.sock',
                                 '--log-level', 'debug',
                                 '-w', str(1),
                                 'wsgi:app'])

    signal.signal(signal.SIGTERM, lambda a, b: sigterm_handler(nginx.pid, gunicorn.pid))

    # If either subprocess exits, so do we.
    pids = set([nginx.pid, gunicorn.pid])
    while True:
        pid, _ = os.wait()
        if pid in pids:
            break

    sigterm_handler(nginx.pid, gunicorn.pid)
    print('Inference server exiting')

Despite all this, whenever a transform job takes longer than approx 30 minutes I will see this message in my logs and the transform job status becomes failed:

2023/01/07 08:23:14 [error] 11#11: *4 upstream prematurely closed connection while reading response header from upstream, client: 169.254.255.130, server: , request: "POST /invocations HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock:/invocations", host: "169.254.255.131:8080"

I am close to thinking there is a bug in AWS batch transform, but perhaps I am missing some other variable (perhaps in the nginx.conf) that could lead to premature upstream termination of my request.

asked 2 years ago985 views
1 Answer
1
Accepted Answer

By looking at hardware metrics was able to determine that the upstream termination only happens when the server was near its memory limit. So my guess is that the OS was killing the gunicorn worker and the 30 minute mark was just a coincidence that happened on my long running test cases.

My solution was to increase the memory available on the server

answered 2 years ago
profile picture
EXPERT
reviewed 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions