Skip to content

Calling lambda UDF function (Python) in redshift (serverless) query hangs forever

0

As Python UDF support is getting deprecated in AWS redshift, I moved a simple python UDF function into a lambda function, and setup the lambda UDF as follows (on our serverless redshift cluster)

CREATE or REPLACE EXTERNAL FUNCTION 
fn_lambda_anonymize(a BIGINT, b varchar) returns character varying
STABLE
LAMBDA 'func_name'
IAM_ROLE 'arn:aws:iam::<account>:role/<role_name>' 
RETRY_TIMEOUT 60000  -- 60 seconds
MAX_BATCH_ROWS 100000;
;

Calling the function on single values works fine without issues.

Calling it on a large table (~17M records) also works fine if I call e.g

select min(fn_lambda_anonymize(col_name, 'secret')), max(fn_lambda_anonymize(col_name, 'secret')) from big_table

=> in SYS_LUDF_DETAIL I can see the lambda executions (using the big batch sizes)

Calling

create table derived_table as select col_name, fn_lambda_anonymize(col_name, 'secret') from big_table also works to create a large derived table

=> in SYS_LUDF_DETAIL I can see the lambda executions (but here it is using smaller batch sizes of 300, and hence many more lambda executions)

=> but anyhow it works, so fine, guess this is internal logic in redshift.

However, when I call a more simple (e.g. same select clause as before but with LIMIT 200):

select col_name, fn_lambda_anonymize(col_name, 'secret'), fn_lambda_anonymize(col_name, 'secret') from big_table limit 200, the query seems to hang forever, and remains in the "returning" state (I once left this unchecked and it racked up 24 hours of processing load on our serverless redshift cluster, and you cannot cancel a query in that state as it's supposedly ready and just returning results (200 records) to the client)

=> in SYS_LUDF_DETAIL I can see the lambda executions, so that part is done (there are also no errors on the lambda executions)

=> when cancelling the query on client side the query stops as failed (although sometimes the client does get partially returned results, e.g. tens of rows but not the 200

I cannot determine the logic used here by redshift, and why a more limited query would cause such issues

As support for the working Python UDF alternative will be removed from redshift in couple of months, hope somebody can point me to a solution/workaround for this (have not tried this on provisioned redshift cluster yet, so not sure how it behaves there)

  • If my answer helped solve your problem, I would appreciate it if you click on “accepted answer”

asked 2 months ago72 views
2 Answers
4

As far as I understand your issue:

LIMIT clause is confusing the Redshift execution plan regarding how to truncate the data stream coming back from the Lambda provider. Using a CTE is the most reliable workaround.

So, the issue likely stems from a buffer deadlock caused by the interaction between your high MAX_BATCH_ROWS and the LIMIT 200 clause, combined with calling the function twice in one row.

When you call LIMIT 200 with a MAX_BATCH_ROWS of 100,000, Redshift may still send a full batch of 100,000 rows to Lambda. While Lambda finishes processing, the Redshift Leader Node is only waiting for 200 rows to send to the client. The remaining 99,800 rows in that batch can "clog" the communication pipe between the compute nodes and the leader, leading to the infinite "Returning" state.

Check the following:

  • Materialize via CTE: Wrap the function call in a CTE or Subquery. This forces Redshift to handle the UDF execution logic separately from the result-streaming logic of the LIMIT clause.
WITH processed_data AS (
    SELECT col_name, fn_lambda_anonymize(col_name, 'secret') as anon_col
    FROM big_table
)
SELECT col_name, anon_col, anon_col 
FROM processed_data 
LIMIT 200;
  • Align Batch Size: Temporarily lower MAX_BATCH_ROWS for interactive queries with small limits. If you only want 200 rows, a batch size of 100,000 is overkill and increases the risk of the "Return" hang.
  • Avoid Redundant Calls: In your query, you call the function twice for the same column. Even if Redshift tries to optimize, it often generates two distinct Lambda requests per row. Use a CTE (as shown above) to call the function once and reference the alias twice.
  • Check Result Cache: Disable the result cache (SET enable_result_cache_for_system_queries TO off;) for the session to see if a corrupted cache entry is holding the "Returning" state hostage.

Tip: If you frequently run interactive queries with LIMIT, consider creating a separate version of the External Function with a much lower MAX_BATCH_ROWS (e.g., 1,000) specifically for development and testing to avoid these serverless compute "hangs“.

According to AWS Best Practices for Lambda UDFs (2026) and the Redshift Developer Guide, large batch settings (MAX_BATCH_ROWS) can cause synchronization issues with the Leader Node when combined with LIMIT clauses. Forcing materialization via a CTE is a documented strategy to decouple UDF execution from result streaming.

see also:

EXPERT
answered 2 months ago
1

Based on the behavior you're describing, where your query hangs in the "returning" state after the Lambda UDF executions complete, this appears to be an issue with how Amazon Redshift is handling the result set assembly after the Lambda function calls finish.

A few things to consider and try:

Regarding the RETRY_TIMEOUT setting: Your current RETRY_TIMEOUT of 60000 milliseconds (60 seconds) controls how long Amazon Redshift will retry Lambda requests if they receive exceptions. However, this doesn't control the overall query timeout. You may want to set a statement_timeout to prevent queries from running indefinitely. The statement_timeout determines the maximum time a query can run (including planning, queueing, and execution) before Amazon Redshift terminates it. For example, you could set SET statement_timeout = 300000; (5 minutes) before running your query to ensure it doesn't hang indefinitely.

Batch size considerations: You mentioned that different query patterns result in different batch sizes (100,000 vs 300 rows). The fact that your LIMIT 200 query hangs while larger operations succeed suggests there may be an issue with how smaller result sets are being processed. Try adjusting the MAX_BATCH_ROWS parameter - perhaps setting it lower (like 1000 or 5000) to see if that changes the behavior.

Monitoring Lambda UDF execution: Since you can see the Lambda executions completing successfully in SYS_LUDF_DETAIL, the issue appears to be in the post-processing phase. Check the total_duration, invocations, and total_rows columns in SYS_LUDF_DETAIL to understand if there's something unusual about how the function is being called in the problematic query pattern.

Workaround suggestions:

  • Try breaking your query into smaller chunks using WHERE clauses instead of LIMIT
  • Consider using a CREATE TABLE AS approach (which you mentioned works) and then selecting from that derived table
  • Test if the issue persists when calling the Lambda UDF only once per row instead of twice in the same SELECT clause

Since this is affecting Redshift Serverless specifically, you may want to test on a provisioned cluster to determine if this is a serverless-specific issue that should be reported to AWS Support.
Sources
statement_timeout - Amazon Redshift
SYS_LUDF_DETAIL - Amazon Redshift
CREATE EXTERNAL FUNCTION - Amazon Redshift

answered 2 months ago
EXPERT
reviewed 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.