VERY peculiar failure after switching from t3.micro to t3a.micro

0

Switched to a t3a.micro reservation after my t3.micro reservation expired. Since then, a very, very strange failure mode in networking has presented. Details:

Debian 'Bullseye' v11.1
'standard' exim4 packages, 4.94.2-7, which is built against gnutls
'standard' gnutls packages, 3.7.1-5

Email had been working normally, and for the most part still does. Except that when email is either forwarded through the server, or sent by one of my users, to a gmail.com email address, after a few days of uptime, messages will stall in the queue, with the following error repeating:

2021-11-22 07:01:26.400 [166789] 1mp428-000gDO-Bj H=alt1.gmail-smtp-in.l.google.com [64.233.171.26] TLS error on connection (recv): Error in the pull function.
2021-11-22 07:01:26.400 [166789] 1mp428-000gDO-Bj H=alt1.gmail-smtp-in.l.google.com [64.233.171.26]:25: Remote host closed connection in response to end of data

That specific error is from gnuTLS. The error
google has four MX's, which are CNAME'd to multiple IP's. It doesn't appear to matter which one of them the connection is to.

I've gone over my exim settings with a fine toothed comb - everything is correct, and my server setup is correct, evidenced by the lack of errors elsewhere.

As part of the diagnostic process, I upgraded the ENA driver to 2.6.0g, with no effect on the errors.
I also switched the MTU of the interface from 9001 (which is what it boots up with) to 1500. No effect.
Notably, regardless of MTU, the interface shows NO errors either on RX or TX.

What does have an effect is rebooting. A few moments after coming back online, the messages will go out to the destination with no error.

The vast majority of email between my server and google's goes through fine, no errors. It has only appeared to happen with emails that come inbound to my server that are then forwarded to a google address - with one exception, in which a particularly large message from one of my users to multiple gmail users also stuck in the queue - with the same TLS error.

It may be coincidental that this began after the switch from t3 to t3a; however, the errors didn't begin happening in earnest until after the switch.

I've exhausted my searches elsewhere. There's a smattering of reports of the same error, but mostly related to a bug in curl from a while back. None specific to exim, or gmail servers. I'm aware that gnutls is...a bit twitchy, might be the best descriptor.

It's not really tenable to have to reboot my server regularly in order to push through a few stalled messages.

My next test will be to revert back to a t3.micro for a short while and see if there's a difference - but in the meantime, does anyone have a clue what could be happening? I realize the probability of an answer is very low with such a bizarro issue...

but - thanks in advance!

cheers.

asked 2 years ago212 views
4 Answers
0

Update: I switched back to t3.micro. An email came in from the same sender and to the same gmail destination. The same error occurred, but instead of sitting in the queue for a day or more, it went through in six minutes.

Now to switch back to t3a.micro and see what happens. Like I said - VERY strange.

answered 2 years ago
0

Update. I let the machine run for another day, as these inbound emails are once per day.

Message in question came in, and was forwarded straight out to gmail, no TLS errors.

I have now switched back to t3a.micro and will report back the results, which I'm reasonably confident will show the same failure mode.

If this does fail again, it points to some sort of issue between t3.micro and t3a.micro architecture. At worst I sell my 3yr/upfront reserved and go back to t3.micro, but it does leave dangling a peculiar problem.

answered 2 years ago
0

Digging deeper, I've found instances of this 'Error in pull function' going back as far as August. However, those errors were one-offs, which would delay messages for minutes, not hours or days.

However, with that sort of evidence - and the inconsistency of same, I'm going to have to declare this a wild goose chase. Since resuming the t3a, there have still been the pull function errors - but they likewise are only delaying messages a few minutes, not hours or days.

I'll mark this answered. I'm unsure whether I should just delete the whole thread, since the evidence doesn't support the theory, so my bloviations are unlikely to be helpful to anyone else...???

answered 2 years ago
0

Digging deeper, I've found instances of this 'Error in pull function' going back as far as August. However, those errors were one-offs, which would delay messages for minutes, not hours or days.

However, with that sort of evidence - and the inconsistency of same, I'm going to have to declare this a wild goose chase. Since resuming the t3a, there have still been the pull function errors - but they likewise are only delaying messages a few minutes, not hours or days.

I'll mark this answered. I'm unsure whether I should just delete the whole thread, since the evidence doesn't support the theory, so my bloviations are unlikely to be helpful to anyone else...???

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions