Best practices of reading messages from a Kafka topic and putting it into S3

0

Hi AWS,

What is the best practices during the process of reading messages from topic and putting into S3, will there be a rare condition where the message can be missed or not posted due to some unknown error ?. In this case can we use a kafka topic in between the compute that would be reading from and posting it to S3— what is the pros and cons here?

So even if a record is missed that can be fetched later also we do not need a second custom application assuming AWS kafka-S3 sink connect ?

2 Answers
0

Hi! With the use of StreamSets Kafka origin you can take Kafka messages and batch them together into appropriately sized and push them to desired destination. Users can send messages to KAFKA which loads the messages to Streamsets and pushes them to the various destinations.

For more information please refer to this: https://streamsets.com/blog/send-kafka-messages-to-s3/

Nitin_K
answered 9 months ago
0

Hello,

To ensure a dependable and consistent transfer of data when extracting messages from a Kafka topic and storing them in Amazon S3, it is crucial to adhere to recommended practices. Take note of the following best practices:

  1. Implement Idempotent Processing: Designing your consumer application to be idempotent ensures that processing the same message multiple times yields the same result as processing it once, enabling effective handling of duplicates while maintaining data integrity.
  2. Use Consumer Groups: To achieve simultaneous processing and fault tolerance in Kafka, consumer groups play a role. By associating your consumer application with a consumer group, you can distribute the workload across multiple consumers and ensure system availability remains intact.
  3. Use a Dead-Letter Queue: Implementing a dead-letter queue (DLQ) is a recommended approach for capturing failed messages during processing.
  4. Enable Message Acknowledgment: Configuring the consumer to acknowledge messages after successful processing and secure storage in S3 prevents data loss by ensuring messages are reliably saved, maintaining data integrity, and minimising the risk of loss.
  5. Monitor Consumer Lag: To ensure that your consumer stays synchronised with incoming messages, it is important to monitor the lag between the Kafka consumer and producer. Lag refers to the measure of how much the consumer lags behind the latest messages in the topic.

When considering the utilisation of an intermediary Kafka topic positioned between the consumer and S3, there are several Pros and Cons to look at:

Pros:

  • Retry Mechanism: If there is an error during the writing process, the messages remain in the intermediate topic, enabling retries without losing data.
  • Decoupling: By incorporating an intermediate Kafka topic, the consumer and S3 writing process become decoupled. This decoupling enables the consumer to concentrate on message reading and processing, while a separate component or application takes charge of writing the messages to S3.
  • Flexibility and Scalability: The intermediate topic provides flexibility and scalability. You can scale the consumer and writer independently, optimising resource utilisation and accommodating varying workloads.

Cons:

  • Latency: The intermediate topic adds an extra step in the pipeline, which introduces a slight delay in message delivery.
  • Increased Complexity: Adding an additional Kafka topic introduces complexity to the overall architecture, requiring additional management, monitoring, and potential points of failure.

The AWS Kafka-S3 sink connector, like the Kafka Connect S3 sink connector, offers a simplified approach by automating the transfer of messages from Kafka to S3. It provides built-in fault tolerance and scalability. By adhering to best practices and evaluating the advantages and disadvantages, you can create a dependable pipeline for reading Kafka messages and storing them in S3.

Please find below some documentation to guide you further

https://aws.amazon.com/blogs/big-data/best-practices-for-running-apache-kafka-on-aws/

https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html

AWS
Kenan_M
answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions