AWS IoT via MQTT randomly succeeds and fails on subscribe and publish messages

0

I have an application using the Mosquitto MQTT library that publishes and subscribes using MQTT messages to AWS IoT. The connection succeeds, and the permissions are correct. I am logging using CloudWatch, and can see the successful connection in the log.

The application subscribes to about 20 topics when it connects, then sends current values for those same topics. More than half the time, this fails. I get messages like this in the CloudConnect log:

{
    "timestamp": "2022-10-13 10:58:21.254",
    "logLevel": "ERROR",
    "traceId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "accountId": "xxxxxxxxxx",
    "status": "Failure",
    "eventType": "Subscribe",
    "protocol": "MQTT",
    "topicName": "soft/increment/UI2",
    "clientId": "58af8c4d-d863-449e-b632-d2aae465797b",
    "principalId": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    "sourceIp": "xxxxxxxxxxxxxxxxx",
    "sourcePort": 52524
}

After this, AWS IoT drops the connection.

If I disable the subscription step in my application, I get similar messages, but for eventType: Publish-In instead of Subscribe.

As far as I can tell, I'm not exceeding any of the limits described here: https://docs.aws.amazon.com/general/latest/gr/iot-core.html#limits_iot

If I allow my application to re-try the connection, it will fail repeatedly until one time it succeeds. At that point it can publish and receive data without any further trouble. This confirms that the permissions are correct.

If I reduce the number of topics from 20 to 8, the chance of success rises substantially, to the point where the connection is almost reliable.

If I add a 100ms delay between each subscription request, the chance of success rises substantially.

If I add a 100ms delay between subscriptions, make a successful connection, then remove the 100ms delay and try again, the connection almost always succeeds, even though it was always failing prior to adding the 100ms delay the first time.

If I remove the subscription step altogether, the chance of success does not apparently change.

Using this same application to connect to the Mosquitto broker never fails.

  • What is going on here? This cannot be explained by permissions, and isn't obviously related to service limits.
  • How can I find more information than just "Failure" when AWS IoT rejects a message?
  • How can I make this reliable?
asked 2 years ago1210 views
2 Answers
0

Hello Greg,

Thank you for responding. I'll try to answer your questions first, then share some interesting observations from Wireshark.

Are you putting multiple subscriptions in one request? Please note IoT Core has this limit: A single SUBSCRIBE request has a quota of 8 subscriptions.

No, I am sending one subscription request per message. If I completely eliminate the subscription step and only attempt to publish, I get the same error, just with a message type of "Publish-In" instead of "Subscribe".

Do you just have a single application/client? Are you saying that the application/client publishes on the same topics it subscribes to? If so, and you have 20 topics, you may be quickly approaching the publish requests per second limit for the connection. That is 100 messages per second (the sum of both Publish-In and Publish-Out).

The connection never gets to 100 messages before it fails. When it fails, it typically starts failing on the very first subscribe or publish message, immediately after the successful connection. I have seen occasions in the log where successes and failures are intermingled, but that is only after the connection has successfully started and is exchanging data. This intermingling of successes and failures is what I'm treating as the "reliable" state, since the connection is not dropped. The documentation says that messages beyond the limit are dropped, but not that the connection is closed, so I could be hitting the limit once the connection is running.

There should be Disconnect events in CloudWatch then.

That is a very good point. I am not seeing disconnect messages in CloudWatch. I guess that means that the connection is not being terminated gracefully. Wireshark tells me that AWS is initiating the socket closure. It initiates both the "Encrypted Alert" exchange and the FIN.

Although you are confident you don't have a permission problem, I would still recommend you temporarily change the IoT policy to a fully permissive policy to ensure this factor is removed from the equation.

I am using a maximally permissive policy. I don't see how my symptoms could be explained by a policy issue. I'm always connecting with the same credentials. The connection always succeeds, and the publish and subscribe requests succeed sometimes. If this were policy related then it could only be explained by a bug in IAM, which I doubt.

Some new info ...

When i was investigating the missing disconnect messages, Wireshark exposed an interesting issue in my application. When the application connects, it sends subscribe and publish messages with some delay between them to account for the AWS transmission limits. This works exactly as expected when connecting over a plain-text socket to a test broker. However, when using SSL, a combination of thread synchronization in my application and OpenSSL behaviour causes all these messages to be accumulated and transmitted at once. The result is that 41 messages (1 connect, 20 subscribe, 20 publish) are all transmitted at once to AWS over a period of about 40 milliseconds. AWS never sends back a single MQTT message, not even a CONNACK. It simply drops the connection, though it does so gracefully at the SSL level.

That brings up some questions:

  1. Does AWS have a problem with clients sending publish and subscribe packets before they receive a CONNACK? If so, the AWS broker violates the specification. Section 3.1.4 says:

Clients are allowed to send further MQTT Control Packets immediately after sending a CONNECT packet; Clients need not wait for a CONNACK packet to arrive from the Server.

  1. How does the packet limit actually get applied? My application is sending 20 subscribe and 20 publish messages (or just 20 publish if I eliminate the subscriptions) over a period of 0.04 seconds. That does not exceed any of the limits in AWS, though it clearly would exceed them if it continued at that rate. Is that considered a breach of the limit?

  2. Even if AWS thinks my application has exceeded a limit, where is the disconnect log message? AWS is clearly initiating the disconnection, but no indication appears in the log, even though the connect, subscribe and publish events are all logged.

answered 2 years ago
0

Hi asthomas.

If I reduce the number of topics from 20 to 8, the chance of success rises substantially, to the point where the connection is almost reliable.

Are you putting multiple subscriptions in one request? Please note IoT Core has this limit: A single SUBSCRIBE request has a quota of 8 subscriptions.

The application subscribes to about 20 topics when it connects, then sends current values for those same topics.

Do you just have a single application/client? Are you saying that the application/client publishes on the same topics it subscribes to? If so, and you have 20 topics, you may be quickly approaching the publish requests per second limit for the connection. That is 100 messages per second (the sum of both Publish-In and Publish-Out).

After this, AWS IoT drops the connection.

There should be Disconnect events in CloudWatch then. Please find those to gain additional insight. In particular, I think you need to see the disconnectReason. Beside CloudWatch, you can also gain insight by monitoring lifecycle events in the MQTT test client or using the Thing activity tab (Manage->All devices->Things-><thingName>->Activity). More information here: https://docs.aws.amazon.com/iot/latest/developerguide/ota-troubleshooting-fleet-disconnects.html

Although you are confident you don't have a permission problem, I would still recommend you temporarily change the IoT policy to a fully permissive policy to ensure this factor is removed from the equation. A very common cause of disconnects is an incorrect IoT policy.

profile pictureAWS
EXPERT
Greg_B
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions