Occasionally getting "MongoServerSelectionError: Server selection timed out..." errors

0

Hi,

We have a lambda application that uses DocumentDB as the database layer. The lambdas are set in the same VPC as the DocumentDB cluster, and we're able to connect, and do all query (CRUD) operations as normal. The cluster is a simple cluster with 1 db.t4g.medium instance.

One of the lambdas is triggered by an SNS queue and gets executed ~1M times over a 24h period. There is a database query involved in each one, and the vast majority of these executions go fine.

The MongoClient is created outside of the handler in a separate file as detailed here: https://www.mongodb.com/docs/atlas/manage-connections-aws-lambda/ so that "warm" lambda executions will re-use the same connection. Our lambdas are executed as async handlers, not using a callback. The MongoClient itself is created in its own file as so:

const uri = `mongodb://${process.env.DB_USER}:${process.env.DB_PASSWORD}@${process.env.DB_ENDPOINT}:${process.env.DB_PORT}/?tls=true&replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false`;

const client = new MongoClient(uri, {
    tlsCAFile: 'certs/rds-combined-ca-bundle.pem'
});

export const mongoClient = client.connect()

A sample handler would be something like this (TypeScript):

import { mongoClient } from "./mongo.client";

const DB_NAME = 'MyDB';

export const snsHandler = async (event: SNSEvent): Promise<void> => {
    const notif = JSON.parse(event.Records[0].Sns.Message);
	
	const item = await mongoClient
		.then(client => client.db(DB_NAME).collection(notif.collection).findOne({ _id: notif.id }))
		.catch(err => {
			console.error(`Couldn't find item with id ${notif.id} from collection ${notif.collection}`, err)
			return null;
		})
		
	// do something with item
}


Every so often (~100 times a day), we get specific errors along the lines of:


MongoServerSelectionError: Server selection timed out after 30000 ms
    at Timeout._onTimeout (/var/task/src/settlement.js:5:157446)
    at listOnTimeout (internal/timers.js:557:17)
    at processTimers (internal/timers.js:500:7)

or

[MongoClient] Error when connecting to mongo xg [MongoServerSelectionError]: Server selection timed out after 30000 ms
    at Timeout._onTimeout (/var/task/src/settlement.js:5:157446)
    at listOnTimeout (internal/timers.js:557:17)
    at processTimers (internal/timers.js:500:7) {
  reason: To {
    type: 'ReplicaSetNoPrimary',
    servers: Map(1) {
      '[REDACTED].docdb.amazonaws.com:[REDACTED]' => [ry]
    },
    stale: false,
    compatible: true,
    heartbeatFrequencyMS: 10000,
    localThresholdMS: 15,
    setName: 'rs0',
    logicalSessionTimeoutMinutes: undefined
  },
  code: undefined,
  [Symbol(errorLabels)]: Set(0) {}
}

or

Lambda exited with error: exit status 128 Runtime.ExitError


In the Monitoring tab of the DocumentDB instance, the CPU doesn't go higher than 10%, and the database connections peak at ~170 (the connection limit on the tg4.medium is 500, unless I'm mistaken), with an average of around 30-40. For the lambda itself, the max concurrent executions peak at ~100. The errors aren't correlated to the peaks - they can happen at any time of the day, throughout the day.

Can anyone provide any insight as to why the connection might be timing out from time to time, please? The default parameters of the MongoClient should keep the connection alive as long as the lambda is still active, and we don't seem to be close enough to the max connection limit.

I'm assuming the way we have it set it is wrong, but I'm not sure how to go about fixing it

Thanks

asked 2 years ago2841 views
1 Answer
1

Hi,

Thanks for reaching out. From the information provided, it does not look like there's any obvious issue with your setup. From what I can tell you have done the best practices in your setup such as creating the MongoClient outside of the handler connection. This should ensure that warm Lambda will reuse the Connection on subsequent requests.

You have mentioned that the Lambda is triggered by SNS with around ~1 million requests per 24/hours and getting ~100 errors on the same time frame which represents an error rate of .01%. As you may already know, Lambda is a highly distributed system and it is expected to encounter some intermittent or transient issues that are usually networking related in nature.

The issue does not necessarily have to be with the Lambda service itself. Each external request by the Lambda function can go through numerous components such as DNS server, load balancers, switches an the like. From the errors here, it seems that the Lambda function is trying to make a request to the Mongo Server which timed out. This request could have been dropped on any of the network components mentioned earlier. You can read more about this here in our documentation

The best practice/recommendation here is to simply retry the request. Since the Lambda is being triggered by SNS, we will not be able to trigger the client(SNS) to retry the actual request. Therefore, this must be handled inside the Lambda function itself. You can add code to retry the request in Lambda or you can handle it differently such as re-sending this message back to the SNS topic. It seems you already have a catch statement here. May I ask if you have added any retry behavior for your code when these timeout issues occur?

Since the vast majority of your Lambda invocations are successful, retrying the request should lead to a successful execution. Please do let me know if you have any questions regarding this.

AWS
SUPPORT ENGINEER
Ryan_A
answered 2 years ago
  • Hi Ryan,

    Thanks for the detailed answer. Am I correct in then assuming that this would be considered a "normal" error in that there's nothing to be done outside of implementing redundancy in the Lambda?

    In relation to this: "Since the Lambda is being triggered by SNS, we will not be able to trigger the client(SNS) to retry the actual request", I thought SNS automatically retries delivery 3 times? https://docs.aws.amazon.com/sns/latest/dg/sns-message-delivery-retries.html. Is that just for when the SNS is under YOUR control, vs when subscribing to an external queue?

  • As a second question, I implemented basic retrying along the lines of:

    const client = createClient(); export const mongoClient = client.connect() .then(client => client) .catch(_ => { const client = createClient(); return client.connect(); }) .then(client => client) ... .catch(err => { throw err; // throw it to recatch later });

    for a total of 3 attempts, though it still fails (3 separate timeouts). If a lambda/connection fails once, can we retry in that lambda, or is it considered "burned" and I need to retrigger a brand new lambda?

  • Hi,

    Thanks for your reply. SNS will only follow that retry behavior if it is unable to reach the Lambda endpoint. SNS invokes Lambda asynchronously but if it is unable to do so then it will retry. In this case, SNS was able to reach the Lambda endpoint so from the point of view of SNS it has successfully passed the event to Lambda. It is mentioned here in the documentation -> The delivery policy defines how Amazon SNS retries the delivery of messages when server-side errors occur (when the system that hosts the subscribed endpoint becomes unavailable)

  • That is strange that it fails with a timeout each time. You should be able to retry in the Lambda if the request. The various SDK's for example have built-in retry behavior. Are you recreating the connection that was created outside of the handler function?

  • Hi Ryan, Thanks for the precision with the SNS delivery behaviour. For the retry, yes, the code above is created in a separate file, then the result awaited in the handler itself. Is there an example (e.g. git) somewhere where I can compare? Or, if necessary, I can separate out the code in question and send it somewhere. As it stands, in the file where I create/call connect(), I catch any errors, then create a new client (createClient()) and retry the connection. I do this 3 times before throwing, hence my question on if a lambda can be considered "burned" (it'll never work). Thanks

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions