Skip to content

Issue with Postgres Function behavior on Aurora RDS for Postgres

0

We have a Postgres function running in Aurora RDS with db.r6g.large computing capacity in our DEV, STG, and UAT environments. We are currently facing a strange issue only on UAT env, where the same function gets executed successfully 5 times continuously and starts hanging when executed every 6th time from the same session. This function returns exactly 1 row every execution with 2 columns, out of which 1 is jsonb with long data. This issue is currently not reproduced in any other environments outside UAT.
When analyzing further we found that the function DDL is causing this issue and the SQL within the function is working fine when executed manually passing bind values any number of times. We found that Postgres by default adds parameters like COST, ROWS, VOLATILITY and these are driving optimizer to execute the SQL in a certain way causing this issue during 6th execution, not sure if this might be related to the work_mem allotted to the session by default. The workaround we have currently is to run ALTER FUNCTION <function_name> RESET ALL to make the optimizer run this function as a new one. But this according to us is not the right fix. We need to find the root cause, request help to troubleshoot this issue .

1 Answer
1

The issue you're experiencing with your PostgreSQL function in Aurora RDS on your UAT environment is indeed unusual and complex. Several factors could be contributing to this behavior:

  1. Query Plan Caching: PostgreSQL caches query plans for performance reasons. The fact that the function works fine for the first five executions but hangs on the sixth suggests that there might be an issue with the cached plan. The optimizer might be choosing a suboptimal plan after multiple executions.

  2. Statistics and Data Distribution: If the data distribution in your UAT environment is significantly different from other environments, it could lead to the query optimizer making poor decisions. This is especially relevant if your function involves complex queries or joins.

  3. Resource Constraints: The db.r6g.large instance type has limited resources. If the function is memory-intensive, especially with the JSONB column, it's possible that you're hitting memory limits after multiple executions. This could explain why resetting the function parameters helps.

  4. Function Volatility: The VOLATILE setting on a function tells PostgreSQL that the function can have side effects and its output can change even with the same inputs. This might be causing the optimizer to reevaluate the function more often than necessary.

  5. Work_mem Setting: As you mentioned, the work_mem parameter could be relevant. If the function requires more memory than allocated, it could lead to poor performance or hanging.

To troubleshoot this issue:

  1. Review the function's EXPLAIN ANALYZE output for both successful and failing executions to identify differences in query plans.

  2. Monitor resource utilization during function execution, particularly memory usage.

  3. Consider adjusting the function's volatility to STABLE if it doesn't modify the database and always returns the same output for the same input.

  4. Analyze the data distribution in your UAT environment and ensure statistics are up to date.

  5. Review and possibly increase the work_mem setting if memory constraints are an issue.

  6. Consider using pg_stat_statements to gather more detailed information about the function's performance over time.

Remember, the fact that this only occurs in the UAT environment suggests that there might be environmental factors at play. Carefully compare the configurations and data between your environments to identify any discrepancies that could be causing this behavior.
Sources
Lock:tuple - Amazon Aurora
Aurora RDS slow performance | AWS re:Post
Handle overloaded Oracle functions in Aurora PostgreSQL-Compatible - AWS Prescriptive Guidance

answered a year ago
EXPERT
reviewed a year ago
  • The overall load on the RDS and the relevant metrics are all good from the database perspective. The usage stats shows that they have sufficient capacity.

    We also had ran the maintenance tasks like REINDEX, VACCUM, ANALYZE on the DB and schemas, but this issue still exists.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.