Occasionally getting "MongoServerSelectionError: Server selection timed out..." errors

Question

Hi,

We have a lambda application that uses DocumentDB as the database layer. The lambdas are set in the same VPC as the DocumentDB cluster, and we're able to connect, and do all query (CRUD) operations as normal. The cluster is a simple cluster with 1 db.t4g.medium instance.

One of the lambdas is triggered by an SNS queue and gets executed ~1M times over a 24h period. There is a database query involved in each one, and the vast majority of these executions go fine.

The MongoClient is created outside of the handler in a separate file as detailed here: https://www.mongodb.com/docs/atlas/manage-connections-aws-lambda/  so that "warm" lambda executions will re-use the same connection. Our lambdas are executed as async handlers, not using a callback. The MongoClient itself is created in its own file as so:

```
const uri = `mongodb://${process.env.DB_USER}:${process.env.DB_PASSWORD}@${process.env.DB_ENDPOINT}:${process.env.DB_PORT}/?tls=true&replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false`;

const client = new MongoClient(uri, {
    tlsCAFile: 'certs/rds-combined-ca-bundle.pem'
});

export const mongoClient = client.connect()
```

A sample handler would be something like this (TypeScript):
````
import { mongoClient } from "./mongo.client";

const DB_NAME = 'MyDB';

export const snsHandler = async (event: SNSEvent): Promise => {
    const notif = JSON.parse(event.Records[0].Sns.Message);
	
	const item = await mongoClient
		.then(client => client.db(DB_NAME).collection(notif.collection).findOne({ _id: notif.id }))
		.catch(err => {
			console.error(`Couldn't find item with id ${notif.id} from collection ${notif.collection}`, err)
			return null;
		})
		
	// do something with item
}
````
---
---

Every so often (~100 times a day), we get specific errors along the lines of:

```

MongoServerSelectionError: Server selection timed out after 30000 ms
    at Timeout._onTimeout (/var/task/src/settlement.js:5:157446)
    at listOnTimeout (internal/timers.js:557:17)
    at processTimers (internal/timers.js:500:7)
```

or

```
[MongoClient] Error when connecting to mongo xg [MongoServerSelectionError]: Server selection timed out after 30000 ms
    at Timeout._onTimeout (/var/task/src/settlement.js:5:157446)
    at listOnTimeout (internal/timers.js:557:17)
    at processTimers (internal/timers.js:500:7) {
  reason: To {
    type: 'ReplicaSetNoPrimary',
    servers: Map(1) {
      '[REDACTED].docdb.amazonaws.com:[REDACTED]' => [ry]
    },
    stale: false,
    compatible: true,
    heartbeatFrequencyMS: 10000,
    localThresholdMS: 15,
    setName: 'rs0',
    logicalSessionTimeoutMinutes: undefined
  },
  code: undefined,
  [Symbol(errorLabels)]: Set(0) {}
}
```

or

```
Lambda exited with error: exit status 128 Runtime.ExitError
```

---
---

In the Monitoring tab of the DocumentDB instance, the CPU doesn't go higher than 10%, and the database connections peak at ~170 (the connection limit on the tg4.medium is 500, unless I'm mistaken), with an average of around 30-40. For the lambda itself, the max concurrent executions peak at ~100. The errors aren't correlated to the peaks - they can happen at any time of the day, throughout the day.

Can anyone provide any insight as to why the connection might be timing out from time to time, please? The default parameters of the MongoClient should keep the connection alive as long as the lambda is still active, and we don't seem to be close enough to the max connection limit.

I'm assuming the way we have it set it is wrong, but I'm not sure how to go about fixing it

Thanks

Answer

Hi,

Thanks for reaching out. From the information provided, it does not look like there's any obvious issue with your setup. From what I can tell you have done the best practices in your setup such as creating the MongoClient outside of the handler connection. This should ensure that warm Lambda will reuse the Connection on subsequent requests.

You have mentioned that the Lambda is triggered by SNS with around ~1 million requests per 24/hours and getting ~100 errors on the same time frame which represents an error rate of .01%. As you may already know, Lambda is a highly distributed system and it is expected to encounter some intermittent or transient issues that are usually networking related in nature.

The issue does not necessarily have to be with the Lambda service itself. Each external request by the Lambda function can go through numerous components such as DNS server, load balancers, switches an the like. From the errors here, it seems that the Lambda function is trying to make a request to the Mongo Server which timed out. This request could have been dropped on any of the network components mentioned earlier. You can read more about this here in our [documentation](https://docs.aws.amazon.com/general/latest/gr/api-retries.html)

The best practice/recommendation here is to simply retry the request. Since the Lambda is being triggered by SNS, we will not be able to trigger the client(SNS) to retry the actual request. Therefore, this must be handled inside the Lambda function itself. You can add code to retry the request in Lambda or you can handle it differently such as re-sending this message back to the SNS topic. It seems you already have a `catch` statement here. May I ask if you have added any retry behavior for your code when these timeout issues occur?

Since the vast majority of your Lambda invocations are successful, retrying the request should lead to a successful execution. Please do let me know if you have any questions regarding this.

Occasionally getting "MongoServerSelectionError: Server selection timed out..." errors

Relevant content