- 最新
- 投票最多
- 评论最多
Some comments to add some perspective based on your response to the previous answer, and hopefully provide some insight on the streaming side of things, specifically:
Firstly, when considering streaming architectures, this is just semantics, but it's helpful to not think of services like Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (MSK) as databases, but instead as stream storage layers for your overall architecture with configurable retention periods to hold your data in the stream bus for some amount of time before it expires. To that effect, Kinesis and MSK provide similar functionality, a stream storage bus that can accept incoming events and hold them for consumers to pull from (it can be helpful to think of streams as a pub/sub mechanism, while consumers are utilizing a pull-based model). Here's a link to an AWS Whitepaper on modern data streaming strategies on AWS for further reference.
To go back to your requirement for downstream data consistency in your dashboards, while push models can certainly provide this consistency to the subscribed consumers in terms of how they receive data, consumers can vary in processing time. Push models make it harder for systems to monitor consumer capacity and handle back pressure, so it is often at the expense of throughput. This is why stream architectures often aim to be asynchronous and use consumers in a pull-model, where you can have your consumers polling and interacting with the stream to process events in parallel and maximize throughput. Here is an interesting read from the Kafka documentation on their considerations between push and pull models when implementing Kafka (though Kafka consumers do support a feature known as long-polling, which provides a push-like interaction, which is mentioned in the previous link).
As far as recommendations for your use-case, there are several ways to go about making a your first streaming application. For a stream storage layer, since you mentioned being a smaller team, Kinesis tends to be easier to spin up quickly due to having fewer configurations to manage. However, both Kinesis Data Streams and Amazon MSK have serverless offerings (KDS On-Demand and MSK Serverless) so if you are partial to trying Kafka, MSK may still have what you're looking for.
If you are looking for a stream processing platform to do transformations or otherwise enrich data that is in the stream, you can certainly make use of Amazon Managed Service for Apache Flink (Amazon MSF, previously named Kinesis Data Analytics, or KDA). Here you can connect to your SQL database and enrich your streaming event data, and then move this data back into a stream storage layer for consumption by your consumer applications. There are several interesting ways to use Apache Flink to enrich your streaming data. If you are particularly interesting in learning more about Flink, here is a great blog that covers some common patterns.
Hope this helps!
Thank you Nathan!
The big thing I got was your suggestion not to use Kafka but only Kinesis. I will think about that.
Let me please clarify our scenario and see if that alters your other comments.
We currently persist our data as follows:
Web Application <-> API Server <-> SQL Server (loans, etc.)
This is "data based" to store loan data but we also want something "event based" such than when an agent closes a new loan a dashboard is updated (within a second or 2, not within milliseconds). We currently have TV screens each pulling (querying the loans table) data independently but they are not synched and there are sometimes delays from calculating data. We are looking for more of a PUSH scenario where events (e.g. when a new loan is created) are registered and streamed to something like Kinesis (probably KDS and KDA/Flink) which would then push totals and other data to dashboards on the TVs, web applications and Android and iPhone mobile devices. So, the above flow would be altered to add/include something like:
Web Application New Loan Event -> Kinesis (KDS) -> Kinesis (KDA/Flink) -> Device Dashboards
Going back to Kafka, one of the reasons I thought this would be necessary is to have a database of events. For example, say Agent 1 closes a loan on Oct 1 for $1, then on Oct 2 for $2 and then on Oct 3 for $3. At the "moment" loan #1 closes on Oct 1 the dashboards report a cumulative monthly total of $1 and then $3 on Oct 2 and $6 on Oct 3, etc. All the while other cumulative totals are being generated for other agents. Therefore, any time there is a new loan event it has to be added to the previous monthly total and the dashboards updated with the new total. Can this be accomplished without a database such as Kafka?
New loans are only one example; there are others which are similar (such as application e-mail stats) but we may also want to include some system health indicators (perhaps from Cloud Watch). The thought was that having a database (like Kafka) might be a more general or simple approach but perhaps Kinesis and Lambdas are just as good or better?
Thanks again for taking the time to lend your opinion and perspective.
If you look at the Flink blog post that @agroe posted, there is a pattern titled "Per-record asynchronous lookup from an external cache system" that might be the right pattern for you. Because your essentially requiring state in your stream to store cumulative running totals by loan agent.
The could be a scale issue, and a state management issue based on time. So by using a cache that is ultimately backed by a database (Aurora in that instance, but could be other DBs). And I would think you ultimately need to flush this to a database for regulatory/personnel management purposes.
Thanks for pointing that out, Nathan.
Here are some core thoughts that might help you:
- I would just stick with Kinesis, I don't see any reason to use Kafka, unless you just want to due to prior familiarity
- Flink is TYPICALLY for streaming analytics - so if you need to do analytics on the data in the stream (in real time) vs landing the data in some storage mechanism, doing analytics, and then consuming the analytics. I didn't see anything in your writeup that indicated the need for FLINK.
- You probably need to land the data either in a database or OpenSearch, but probably both since it looks like you are dealing with loan data. I'm assuming this database already exists, but you didn't indicate what the database is, is it in the cloud, etc.
- Opensearch is perfectly made for this type of use case, especially if you need "immediate" visibility into loan activity. You used the word immediate in your description, so I would challenge you to think about what that means. Do you mean "sub-second" or within a few seconds? Those requirements are important here. If it only needs to be within a few seconds or a minute, then just store it in whatever db you are using and then hook up a quicksight dashboard, or webpage that refreshes every so often (polling) vs being event driven (streaming).
- You should also look at Kinsesis Enhanced Fan-Out (https://docs.aws.amazon.com/streams/latest/dev/building-enhanced-consumers-api.html) . This might be important if you end up needing to scale out to multiple stream subscribers.
Here are a few links to consider:
- https://docs.aws.amazon.com/opensearch-service/latest/developerguide/integrations.html
- https://aws.plainenglish.io/using-aws-opensearch-with-dynamodb-and-lambda-fd036add8c88
- https://medium.com/rahasak/index-dynamodb-data-on-elasticsearch-kibana-using-aws-lambda-e12ac2b82f81
- https://reflectoring.io/processing-streams-with-aws-kinesis/
Thank you Nathan for taking the time to give your perspective. Please see my clarifications to your questions below...
相关内容
- AWS 官方已更新 4 年前
- AWS 官方已更新 3 年前
- AWS 官方已更新 2 年前
Thanks agroe! Yes, I am open to learning more about Flink. Do you have a personal take on the scenarios for which Flink shines and scenarios for which Flink may not be the best choice?
Hi agroe, I am sure my thinking needs an adjustment here but pertaining to the Pull/Push scenarios... My picture of how streaming applications generally work is like when I drop a ball in a river upstream (initial producer) and the current takes the ball downstream through bends and rapids (data enrichment, etc.) and someone picks up the ball down stream (final consumer). In this visual the ball is transported (processed) continuously without needing to wait and be "pulled" at steps along the way. How is my thinking flawed in terms of expecting application events to be processed and sent to a dashboard? Thanks
I wouldn't say your thinking is flawed, but the key thing to note is that stream consumers should be persistent, running 24/7 alongside the stream, and are using API calls to specifically get records (pull) from the stream (be it Kinesis Data Streams or a Kafka Topic) before pushing either back to a stream or to another downstream application, like your dashboards. It's understandable to want to visualize one ball (record) moving through the stream to simplify things, but then you naturally get concerned about consumers having to "wait" for events. In reality, streaming applications are best suited to workloads that need high throughput and low latency, so consumers are rarely ever waiting for one record to appear (though of course there may be specific use cases where this might happen). Instead, streams are ideally carrying a high volume of data per second, from which consumers make their API calls to acquire batches of data from the stream that they can then process in parallel to achieve the high throughput and low latency requirements.
Thanks agroe. I get what you are saying. I guess I thought of it like events in a GUI program where (for example) a dashboard would execute in a continuous loop and then when the event happened of dropping off another record from the stream the dashboard would process it and then go back into the continuous loop waiting for the next event. Apparently this is not the correct thinking with streaming applications so this is good to know; it will help prevent me from trying to fit a square peg into a round hole according to "round" biases. I appreciate your feedback.
Agroe, On a bit of a side note, what strategies would you recommend to keep multiple consumers (dashboards) in sync with each other (as it seems multiple consumers pulling data on their own independently could get out of sync)?