Suggestion on resolving API Gateway timeout at 30sec

0

Hi,

I have a Python Flask search API endpoint, which on longer time ranges can take minutes to return. The python code is basically:

def search submit a search query POST API call receive a job id from the call above submit GET API call to poll the jobid, with sleep(5) in between the polls, until search results are returned.

This works fine when search returns within 30 secs. Longer than that it returns 504. I deployed with ALB, worked fine with <30sec queries, but on >30sec got the same 504, actually, 502 Bad Gateway as a consequence of 504 and dropped connection. This is expected as ALB can't help with the load to speed up in this case as search has sleep(5), and it is bound by the speed of searching in the data source.

What would be the least painful way to resolve this?

Thank you, Boris

1 Answer
0

It won't seem like the "least painful" but the best way to resolve this problem is to change your application so that the calls to API Gateway are asynchronous rather than synchronous. You'd do something like:

  1. Front end calls API Gateway and submits the request and in return gets a unique token of some sort.
  2. Back end stores a placeholder for the request with the token identifier then goes away and does the work. The status of the placeholder at this point in time will be something like "Work in progress".
  3. When the work is complete the back end stores the data somewhere (presumably in the same place where the placeholder is but it might be too big for there so maybe S3); and marks the placeholder as "Complete". If there is a problem, mark it as "Error".
  4. Periodically the front end calls a separate API to check on the status of the original request using the token as the identifier. If the work is complete, it gets the result (or can call another API to retrieve it); if the work is not complete it gets the current status. You can tune the front end so that it doesn't make too many calls but gets the result in a short period after the work is complete

So there is effort here but it gives you flexibility to have long-running jobs far beyond 30 seconds.

Note that you can do the same sort of thing by registering a callback at step (1) but it assumes that the callback address (HTTPS link of some sort). When the job is done, a Lambda connects to the callback address and delivers the result. The challenge here is making sure that the callback address is reachable when the work is complete which may (or may not be!) easily achieved.

profile pictureAWS
EXPERT
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions