Does the AWS Clickstream solution take into account session_id stickiness?

0

I have followed the implementation guide for the AWS clickstream solution, currently using Kinesis on demand, sinking to an S3 bucket (just following the steps in the setup). I'm not bothered with setting up data processing and analytics dashboards. I simply want to use the data ingestion module and consume from the stream (MSK or Kinesis) using custom built processing consumers.

Requirements: I need to spin up a fleet of consumers as I potentially will be dealing with huge volumes of data that requires parallelised consumption and processing of the data in the stream. All events that belong to a single session (example '_session_id' ) MUST end up in the same consumer.

If the events get randomly passed into different partitions or shards, then the session becomes split across multiple consumers which completely breaks my processing of the data stream. I know kinesis and Kafka support partition keys so that data arriving into the streams with the same partition key (ideally a session ID of some kind) are ensured to end up in the same partition and therefore end up in the same consumer. The documentation is not clear about how incoming events sent from the client application within a single "_session_id" (or some other identifier of a users events) are split up across partitions within within the data stream.

The clickstream solution seems designed to dump the data into an S3 bucket which only fine if I wasn't doing real-time processing of my data. I could simply sort my S3 data into sessions and do batch processing then, but this is not what I want.

Are there any experts on the AWS clickstream solution that can tell me if partition keys are taken into account in order to sort the session's events into partitions/shards? Or is the ingestion module designed to just dump all incoming data randomly into whatever partition/shard it wants?

EDIT: Maybe another related question is: If it does not partition the data by a session_id or other similar ID, is it possible to modify the solution to customize it? I'm guessing it would be clone the AWS clickstream analytics repo -> modify the source code for the vector server(the configuration toml files I'm assuming) to partition by session_id -> re-bootstrap the cdk, re-deploy the stack with the modified vector server configuration? Is this possible? Or are there components that the solution pulls from image repositories that are not created by the CDK and therefore non-modifiable? In other words, is the entire solution completely customisable by forking and modifying the GitHub repo?

Thanks.

1 Answer
0

Hello, Kinesis partitions data randomly but attempts an even distribution of keys across shards over time. .You will eventually have the KCL (dressed up as Clickstream) to drive this sharding.

You can find more on this here: https://docs.aws.amazon.com/solutions/latest/clickstream-analytics-on-aws/data-sink-kinesis.html

profile picture
EXPERT
answered 19 days ago
  • This does not answer whether the ingestion module (specifically the ingestion server) writes to Kinesis/kafka using a partition key (ideally sessiond_id from the events generated by the clickstream SDK) in the first place. Do you know if the ingestion server handles partition keys or sends data without assigning a partition key?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions