After upgrade of RDS to Postgres 16.1 invocation times of Lambda via aws_lambda extension has increased significantly

1

We're using the aws_lambda.invoke extension to invoke a lambda function from a query. We have been seeing the invocations to be taking a siginificantly longer time than a week or so ago. (and the only thing that has changed in this time is we upgraded the RDS engine version from 15.x to 16.1).

  • Cold invocation times via RDS take about ~6 seconds.
  • Warm invocation times via RDS take about ~3 seconds.
  • Direct invocation of lambda via API or aws-cli takes less than 100ms (warm or cold)

Not very sure how to debug this as RDS is a fully-managed service. :)

RDS Instance specification:

  • Engine: PostgreSQL
  • Engine version: 16.1
  • Instance class: db.t4g.small
  • Parameter/Options group customizations: None
  • Storage type: gp3
  • Networking: Private subnet group

Lambda specification:

  • Architecture: arm64
  • Deployment type: docker images hosted on AWS ECR
  • Memory: 128 MB
  • Timeout: 3 seconds
  • I'm seeing the same behavior on a postgres upgrade from 13.10 to 14.10 - We have very warm invocations on our lambda call (many times per second). The lambda invocations take less than 300ms as observed by x-ray tracing and the asynchronous responses are said to be coming out in less than 20ms. But even a basic call of aws_lambda.invoke takes > 2 seconds in roundtrip time at RDS. This makes no sense for an asynchronous call. I don't know what aws_lambda.invoke is doing to cause this?

  • I was able to reproduce this issue by creating a small 13.10 postgres database, duplicating the lambda invocation on a simple lambda function, and testing the invocations. On 13.10, the roundtrip is <50ms. I then upgraded the database to 14.10 and the invocations again take >2 seconds. I'm going to attempt to create a 14.10 postgres and see if the problem persists without the upgrade step.

    EDIT: A new 14.10 instance behaves the same (invocations taking > 2 seconds). I'm going to try a few more versions. This feels like an issue with the aws_lambda extension for postgres, but I'm not sure. EDIT2: 13.13 Works fine. 15.5 Fails. 16.1 Fails. Seems like >14 there's an issue with the aws_lambda extension. EDIT3: I've also tried this on different sizes (t4.large, t4.micro) if that matters. Additionally, I've been using gp2 storage compared to the original posters' gp3. EDIT4: New frustrating discovery: aws_lambda works fine on Aurora v14.10, just not on rds-postgresql v14.10

  • As an update, I had to pay for technical support to file an issue with AWS. They've escalated it to their product team. Seems it may be an actual bug with RDS. Who knows how long they'll take to provide an update. The only intermediate solution they suggested was to upgrade to a db.t3/4g.2xLarge server. We run a t4g.large, so I'm not happy with their suggestion to 4x our costs.

    Another example of how its definitely an AWS issue: I was able to create a blue/green deployment with our original t3.large/14.10 on blue and an upgraded t4g.large/16.1 on green. I ran tests for a day demonstrating the green deployment didn't have the latency issue. As soon as I did the switch-over, the new blue instance (t4g.large/16.1) started exhibiting the latency.

  • Thanks for the update @stephen. We are experiencing the same issues (on db.m6g.xlarge) and I've just lodged a technical support ticket too.

  • Update: We developed our own work-around as AWS was taking too long. They've replied back with the following.

    "The increased latency in aws_lambda extension on your PostgreSQL 14.10 instance is due to a compatibility issue we have identified in the aws_lambda extension and the AWS SDK library version. We are working on the fixing this, and the fix will be available in a future minor PostgreSQL version."

    Their recommended mitigation strategy remained as "Upgrade to something with 8 VCPU or larger" - which is patently ridiculous that we should pay 4x more to stop-gap their issue. My guess is their updates to the aws_lambda extension will get released when postgres releases a new version. 16.2 is set to be released on Feb 8th, but their fixes could probably take longer.

1 Answer
-2

The first thing that I would look at are RDS Performance Insights when your Lambda functions are running. This will give you a can give you a good look as to how the instance and your queries are performing.

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.Overview.html It is very possible that you may have to optimize your queries as you move to a newer version of Postgres. Here is a link with some good suggestions for query optimization.

https://aws.amazon.com/blogs/database/optimizing-and-tuning-queries-in-amazon-rds-postgresql-based-on-native-and-external-tools/

Without knowing more about your environment a db.t4g.small only has 2GB of memory and 2 vCPUs - It's a small instance.

profile pictureAWS
answered 5 months ago
  • Hi! Thanks for taking the time to reply :) But unfortunately I don't think it has anything to do with the query or the RDS instance size itself as the query is a simple lambda invocation and nothing more. In fact this is the query

    SELECT * from aws_lambda.invoke(aws_commons.create_lambda_function_arn('arn:aws:lambda:ap-southeast-1:444455556666:function:example', 'ap-southeast-1'), '{"body": "test"}'::json );

    So I believe it has something to do with how the aws_lambda extension calls the lambda APIs from postgres

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions