Ordered message delivery to downstream consumers when transient faults occur



We are architecting a solution, and considering Kinesis. One of the primary reasons is the guaranteed message ordering per shard (SQS Fifo is too slow). Our likely exception handling process would be:

  1. Invocation faults get handled as per event source mapping policy
  2. Application non-transient faults (invalid payload) get caught and manually forwarded to SQS to prevent pointless retries
  3. Application transient faults are uncaught, and event source mapping policy will be configured with maximum message age (e.g. 5 minutes) and then move failed batch metadata to SQS (awaiting replay) and continue.

So, IF a network / db fault occurs, the shard will be blocked on the current message until the message age expires. If that happens, then messages can be delivered to downstream consumers (by our lambda) out of order.

The best we can achieve is to provide something like an SLA that can be considered / designed around for each consumer?

Have I missed anything obvious in my conclusion?


asked 2 years ago188 views
1 Answer

A thought based on the question: Kinesis and SQS operate differently. So when you say "the shard will be blocked" - individual consumers on the shard might choose not to consume the next message in the stream but there's no concept of "blocking". Unlike SQS, messages in the stream are visible to all of the consumers so they can choose what they're going to do with each message - which is great if you have several different processes that need to happen on a single message - you can use different consumers and they don't get in each other's way.

So in the Kinesis world, if you want to maintain ordering you can only have a single consumer on the stream (shard, really). If there is a fault the blocking happens in the consumer, not in Kinesis.

Probably not helpful - but I'd question why Kinesis is better than SQS FIFO in this case.

Finally: I'm a little concerned about the comment "SQS FIFO is too slow" - have you tested Kinesis to ensure that it meets your performance requirements?

Given the complexity of the question and the challenges you appear to be facing I'd contact your local AWS Solutions Architect to discuss further...

profile pictureAWS
answered 2 years ago
  • Hi there, thanks for replying.

    Regsarding SQS Fifo being too slow, it's bee a long time since I checked FIFO throughput. Checking now, 3000msg/sec is a lot higher than it was, so probably not a concern any longer, but i'd like to focus on Kinesis. Yes, we have run soak tests through Kinesis and we're very happy with throughput.

    Regarding "shard will be blocked" comment - Of course, not the shard itself, i meant the consumer will not be able to continue receiving new batches.

    You mention "So in the Kinesis world, if you want to maintain ordering you can only have a single consumer on the stream (shard, really)." this is not what i understand? If multiple consumers are reading from a single shard, they are all receiving the same data completely independantly from other consumers, but the shard guarantees order, so they all get ordered data?

    Finally, I have reached out to AWS locally - it's just taking some time and i was looking to expedite.


You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions