40-byte TCP packets randomly rejected in VPC Flow Logs

0

Possibly related: https://repost.aws/questions/QUcNiaV2eCSm2_eWZgajO9Ig/timeouts-on-reverse-proxy-after-enabling-dns-hostnames

We have a typical VPC configured with a public subnet running an nginx reverse proxy on 10.0.0.5, which also has a public IP, and private subnet hosting our application, which is serving a web frontend on 10.0.1.123 connected to a backend, also on the private network. We are having difficulty accessing the application from the internet, with users experiencing random timeouts and gateway 503 timeout errors. Restarting the backend usually resolves the issue, but after a few hours the issue appears again. We didn't find anything indicating errors in application logs, and all processes seem to be responding normally. Accessing the instances via SSH sometimes also experiences similar delays, but retrying usually resolves the issue.

While trying to debug this, I have enabled VPC Flow Logs to try to see where the connectivity is failing. I noticed normal traffic between the reverse proxy and the app frontend on port 3050, with the app responding normally to the ephemeral port that opened the connection. But sometimes I see REJECT messages in the flow logs, always in the direction of the frontend to the proxy, and always a single 40 byte package. There are no other significant REJECT messages I can find on other network interfaces in the VPC.

2 68xxxxxxxx00 eni-04xxxxxxxxxxxxxc2 10.0.1.123 10.0.0.5 3050 57408 6 7 811 1707817290 1707817348 ACCEPT OK    
2 68xxxxxxxx00 eni-04xxxxxxxxxxxxxc2 10.0.0.5 10.0.1.123 57384 3050 6 9 1515 1707817290 1707817348 ACCEPT OK   # normal communication
2 68xxxxxxxx00 eni-04xxxxxxxxxxxxxc2 10.0.1.123 10.0.0.5 3050 57384 6 1 40 1707817290 1707817348 REJECT OK     # why is this blocked? there is a response on this port 2 lines down
2 68xxxxxxxx00 eni-04xxxxxxxxxxxxxc2 10.0.0.5 10.0.1.123 57408 3050 6 9 1519 1707817290 1707817348 ACCEPT OK   
2 68xxxxxxxx00 eni-04xxxxxxxxxxxxxc2 10.0.1.123 10.0.0.5 3050 57384 6 7 811 1707817290 1707817348 ACCEPT OK    # normal communication
2 68xxxxxxxx00 eni-04xxxxxxxxxxxxxc2 10.0.0.5 10.0.1.123 57416 3050 6 10 1507 1707817290 1707817348 ACCEPT OK  # normal communication
2 68xxxxxxxx00 eni-04xxxxxxxxxxxxxc2 10.0.1.123 10.0.0.5 3050 57416 6 9 892 1707817290 1707817348 ACCEPT OK    # normal communication
2 68xxxxxxxx00 eni-04xxxxxxxxxxxxxc2 10.0.1.123 10.0.0.5 3050 57446 6 7 724 1707817290 1707817348 ACCEPT OK    
2 68xxxxxxxx00 eni-04xxxxxxxxxxxxxc2 10.0.1.123 10.0.0.5 3050 57446 6 1 40 1707817290 1707817348 REJECT OK     # why is this blocked? the previous 7 packets went through fine
2 68xxxxxxxx00 eni-04xxxxxxxxxxxxxc2 10.0.1.123 10.0.0.5 3050 57408 6 1 40 1707817290 1707817348 REJECT OK     
2 68xxxxxxxx00 eni-04xxxxxxxxxxxxxc2 10.0.1.123 10.0.0.5 3050 55158 6 1 40 1707817350 1707817408 REJECT OK     
2 68xxxxxxxx00 eni-04xxxxxxxxxxxxxc2 10.0.1.123 10.0.0.5 3050 55158 6 7 810 1707817350 1707817408 ACCEPT OK    # why is this accepted? the previous packet was blocked

What could these 40 byte packages be, is it some sort of TCP/HTTP keep-alive traffic? Or something related to DNS? It's not ICMP because the protocol field is 6. Why are they being rejected? I also noticed the difference between start and end times is almost always 58 or 60 seconds, what could this indicate? nginx keepalive_timeout is configured to 65 seconds, and the operating system TCP keepalive settings are as follows:

net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

Thanks for any help!

strophy
asked 2 months ago101 views
2 Answers
1

Given that the time difference is (generally) around a minute the lone packet you're seeing would seem to be some sort of retransmission. The only way to be sure would be to run a packet capture using (say) tcpdump on the reverse proxy or on the front end server. You might also use VPC traffic mirroring to do the same thing but it's a bit more effort to get there. Either way you need to see the packet dump to really know what the packet is and what the root cause is.

As far as the other article you've linked: I can't speak to the behaviour of those systems but if there is a reverse DNS lookup that is happening before the host responds and that takes longer than the TCP handshake timeout then it could be why there is a retransmission. But that's just a guess.

profile pictureAWS
EXPERT
answered 2 months ago
0
Accepted Answer

This problem was eventually traced to net.netfilter.nf_conntrack_max being set too low on the reverse proxy. The problem was identified by inspecting dmesg logs, which showed netfilter dropping packets. Running sudo sysctl -w net.netfilter.nf_conntrack_max=131072 resolved the issue.

strophy
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions