A New Streaming Application

0

As I am new to using AWS streaming technologies (such as: Kafka, Kinesis, Flink, etc.) it has been challenging to wrap my mind around all that is available and to establish a proper approach despite internet self-education attempts. I am asking for general guidance and some explanation for a proper strategy and approach to building a simple streaming application.

Background:

We currently have a financial application which resides in AWS and interfaces with SQL Server. Let's say we have a loan table with the following fields like this:

LoanIDDateAmountCustomerAgent
110/6/2023$1JoeSusie

Now, whenever a new loan is created we would like to update a dashboard immediately. This dashboard could be a TV display, a web app or a phone app. To accomplish this we are looking in to modifying the financial application at the point where it creates a new loan record in the table by also creating an event which will be streamed to the dashboard to update the new monthly totals for the agents.

Initial Thoughts

My initial thoughts are as follows; however, they are just initial thoughts and there may be better ways to accomplish the goal.

We are a small company with relatively few developers so a serverless approach might be wise but we are also nimble and data driven so we need flexibility. My research indicates to consider these types of things as each strategy has pros/cons and may be more appropriate than another given the situation.

The initial thoughts are to use Kafka (MSK) to record the events viz. the financial application would submit a new loan event to Kafka after creating a new loan record in the SQL table. The Kafka database would have a topic for these events where events would live for about 90 days before getting purged thus keeping the topic small with only recent data with the option to look back at the prior month if needed. Alternatively, it seems like Kinesis could also be used but then we would have to store state information in RocksDB or some such. Generally, Kafka seems good for smaller scenarios while Kinesis seems good for larger scenarios. I have no basis at this point to surmise about throttling necessities or data back-pressure possibilities at this stage but I would assume any such phenomenon would not be drastic. Therefore, it would seem Kafka may be the better option for us.

Next, we would probably have no more than 80 subscribers to the streaming data. My research indicates that this should be fine for Kafka as long as we don't get into 1,000s of subscribers. Nonetheless, Flink seems to be an option to connect with Kafka (keeping the Kafka connections down) which would then allow subscriptions from all the apps and dashboard displays. The model would then be data being pushed to all subscribers keeping them in sync rather than each subscriber making independent pull requests which could result in un-synched inconsistencies.

Is this a viable approach? Any better alternatives given our scenario?

Another challenge I encountered was getting everything coupled together in AWS and then trying to figure out how to write my .jars and other code to connect to these technologies and/or drive the application (like creating a .jar for the Flink application). Internet documentation says to use pre-made connectors, libraries and APIs (e.g., KCL/KPL) but the documentation on how/where to get these libraries seems sparse.

Bottom line, I am not sure which direction to pursue and doing trial/error research exercises has not been too fruitful or insightful. Therefore, I welcome any voice of wisdom or experience to help guide us in this matter. Any documentation references would be golden.

Thank you

P.S. As I am new to the forum, if this post is inappropriate then I apologize in advance and ask that it be re-posted in a more appropriate forum.

3 Answers
4
Accepted Answer

Some comments to add some perspective based on your response to the previous answer, and hopefully provide some insight on the streaming side of things, specifically:

Firstly, when considering streaming architectures, this is just semantics, but it's helpful to not think of services like Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (MSK) as databases, but instead as stream storage layers for your overall architecture with configurable retention periods to hold your data in the stream bus for some amount of time before it expires. To that effect, Kinesis and MSK provide similar functionality, a stream storage bus that can accept incoming events and hold them for consumers to pull from (it can be helpful to think of streams as a pub/sub mechanism, while consumers are utilizing a pull-based model). Here's a link to an AWS Whitepaper on modern data streaming strategies on AWS for further reference.

To go back to your requirement for downstream data consistency in your dashboards, while push models can certainly provide this consistency to the subscribed consumers in terms of how they receive data, consumers can vary in processing time. Push models make it harder for systems to monitor consumer capacity and handle back pressure, so it is often at the expense of throughput. This is why stream architectures often aim to be asynchronous and use consumers in a pull-model, where you can have your consumers polling and interacting with the stream to process events in parallel and maximize throughput. Here is an interesting read from the Kafka documentation on their considerations between push and pull models when implementing Kafka (though Kafka consumers do support a feature known as long-polling, which provides a push-like interaction, which is mentioned in the previous link).

As far as recommendations for your use-case, there are several ways to go about making a your first streaming application. For a stream storage layer, since you mentioned being a smaller team, Kinesis tends to be easier to spin up quickly due to having fewer configurations to manage. However, both Kinesis Data Streams and Amazon MSK have serverless offerings (KDS On-Demand and MSK Serverless) so if you are partial to trying Kafka, MSK may still have what you're looking for.

If you are looking for a stream processing platform to do transformations or otherwise enrich data that is in the stream, you can certainly make use of Amazon Managed Service for Apache Flink (Amazon MSF, previously named Kinesis Data Analytics, or KDA). Here you can connect to your SQL database and enrich your streaming event data, and then move this data back into a stream storage layer for consumption by your consumer applications. There are several interesting ways to use Apache Flink to enrich your streaming data. If you are particularly interesting in learning more about Flink, here is a great blog that covers some common patterns.

Hope this helps!

AWS
EXPERT
agroe
answered 6 months ago
  • Thanks agroe! Yes, I am open to learning more about Flink. Do you have a personal take on the scenarios for which Flink shines and scenarios for which Flink may not be the best choice?

  • Hi agroe, I am sure my thinking needs an adjustment here but pertaining to the Pull/Push scenarios... My picture of how streaming applications generally work is like when I drop a ball in a river upstream (initial producer) and the current takes the ball downstream through bends and rapids (data enrichment, etc.) and someone picks up the ball down stream (final consumer). In this visual the ball is transported (processed) continuously without needing to wait and be "pulled" at steps along the way. How is my thinking flawed in terms of expecting application events to be processed and sent to a dashboard? Thanks

  • I wouldn't say your thinking is flawed, but the key thing to note is that stream consumers should be persistent, running 24/7 alongside the stream, and are using API calls to specifically get records (pull) from the stream (be it Kinesis Data Streams or a Kafka Topic) before pushing either back to a stream or to another downstream application, like your dashboards. It's understandable to want to visualize one ball (record) moving through the stream to simplify things, but then you naturally get concerned about consumers having to "wait" for events. In reality, streaming applications are best suited to workloads that need high throughput and low latency, so consumers are rarely ever waiting for one record to appear (though of course there may be specific use cases where this might happen). Instead, streams are ideally carrying a high volume of data per second, from which consumers make their API calls to acquire batches of data from the stream that they can then process in parallel to achieve the high throughput and low latency requirements.

  • Thanks agroe. I get what you are saying. I guess I thought of it like events in a GUI program where (for example) a dashboard would execute in a continuous loop and then when the event happened of dropping off another record from the stream the dashboard would process it and then go back into the continuous loop waiting for the next event. Apparently this is not the correct thinking with streaming applications so this is good to know; it will help prevent me from trying to fit a square peg into a round hole according to "round" biases. I appreciate your feedback.

  • Agroe, On a bit of a side note, what strategies would you recommend to keep multiple consumers (dashboards) in sync with each other (as it seems multiple consumers pulling data on their own independently could get out of sync)?

0

Thank you Nathan!

The big thing I got was your suggestion not to use Kafka but only Kinesis. I will think about that.

Let me please clarify our scenario and see if that alters your other comments.

We currently persist our data as follows:

Web Application <-> API Server <-> SQL Server (loans, etc.)

This is "data based" to store loan data but we also want something "event based" such than when an agent closes a new loan a dashboard is updated (within a second or 2, not within milliseconds). We currently have TV screens each pulling (querying the loans table) data independently but they are not synched and there are sometimes delays from calculating data. We are looking for more of a PUSH scenario where events (e.g. when a new loan is created) are registered and streamed to something like Kinesis (probably KDS and KDA/Flink) which would then push totals and other data to dashboards on the TVs, web applications and Android and iPhone mobile devices. So, the above flow would be altered to add/include something like:

Web Application New Loan Event -> Kinesis (KDS) -> Kinesis (KDA/Flink) -> Device Dashboards

Going back to Kafka, one of the reasons I thought this would be necessary is to have a database of events. For example, say Agent 1 closes a loan on Oct 1 for $1, then on Oct 2 for $2 and then on Oct 3 for $3. At the "moment" loan #1 closes on Oct 1 the dashboards report a cumulative monthly total of $1 and then $3 on Oct 2 and $6 on Oct 3, etc. All the while other cumulative totals are being generated for other agents. Therefore, any time there is a new loan event it has to be added to the previous monthly total and the dashboards updated with the new total. Can this be accomplished without a database such as Kafka?

New loans are only one example; there are others which are similar (such as application e-mail stats) but we may also want to include some system health indicators (perhaps from Cloud Watch). The thought was that having a database (like Kafka) might be a more general or simple approach but perhaps Kinesis and Lambdas are just as good or better?

Thanks again for taking the time to lend your opinion and perspective.

Tony
answered 7 months ago
  • If you look at the Flink blog post that @agroe posted, there is a pattern titled "Per-record asynchronous lookup from an external cache system" that might be the right pattern for you. Because your essentially requiring state in your stream to store cumulative running totals by loan agent.

    The could be a scale issue, and a state management issue based on time. So by using a cache that is ultimately backed by a database (Aurora in that instance, but could be other DBs). And I would think you ultimately need to flush this to a database for regulatory/personnel management purposes.

  • Thanks for pointing that out, Nathan.

0

Here are some core thoughts that might help you:

  1. I would just stick with Kinesis, I don't see any reason to use Kafka, unless you just want to due to prior familiarity
  2. Flink is TYPICALLY for streaming analytics - so if you need to do analytics on the data in the stream (in real time) vs landing the data in some storage mechanism, doing analytics, and then consuming the analytics. I didn't see anything in your writeup that indicated the need for FLINK.
  3. You probably need to land the data either in a database or OpenSearch, but probably both since it looks like you are dealing with loan data. I'm assuming this database already exists, but you didn't indicate what the database is, is it in the cloud, etc.
  4. Opensearch is perfectly made for this type of use case, especially if you need "immediate" visibility into loan activity. You used the word immediate in your description, so I would challenge you to think about what that means. Do you mean "sub-second" or within a few seconds? Those requirements are important here. If it only needs to be within a few seconds or a minute, then just store it in whatever db you are using and then hook up a quicksight dashboard, or webpage that refreshes every so often (polling) vs being event driven (streaming).
  5. You should also look at Kinsesis Enhanced Fan-Out (https://docs.aws.amazon.com/streams/latest/dev/building-enhanced-consumers-api.html) . This might be important if you end up needing to scale out to multiple stream subscribers.

Here are a few links to consider:

answered 7 months ago
  • Thank you Nathan for taking the time to give your perspective. Please see my clarifications to your questions below...

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions