- Neueste
- Die meisten Stimmen
- Die meisten Kommentare
Hello,
To read multiple kinesis sources you can create a DataFrame for each stream and use a union function before passing it to forEachBatch. If you want to process the data separately on the same job, separate threads should be coordinated which is complex to implement and hence it is not recommended.
You can also refer to the following documentation for more guidance on Streaming ETL jobs in AWS Glue: https://docs.aws.amazon.com/glue/latest/dg/add-job-streaming.html
If you need specific guidance for your use-case, please open a support case with AWS using the following link: https://console.aws.amazon.com/support/home#/case/create
Thanks! I ended up using separate thread for each stream. Why is it not recommended?
Yes, you just need to create a DataFrame for each stream and union() them before passing it to forEachBatch.
Notice that assumes your function can process data coming from either of them.
If you mean processing them in separately on the same job, that requires calling forEachBatch on separate threads and coordinating them, it's much more complex to operate and not recommended.
Relevanter Inhalt
- AWS OFFICIALAktualisiert vor 3 Jahren
- AWS OFFICIALAktualisiert vor 2 Jahren
- Wie behebe ich den Fehler „java.lang.OutOfMemoryError: Java heap space“ in einem AWS Glue-Spark-Job?AWS OFFICIALAktualisiert vor 2 Jahren
- AWS OFFICIALAktualisiert vor 3 Jahren
They could have interference (e.g. fighting for driver memory) and in general much harder to monitor and operate (e.g., what happens if one of them fails, do you restart the whole job?)