Does AWS Glue have an issue with multi-threaded loads

0

I'm experiencing a strange issue with AWS Glue. In order to speed up loads, I'm running multiple threads (spark.scheduler.mode=FAIR, Python multiprocessing.pool.ThreadPool(thread_count=5)). Each thread loads up a specific JDBC database table (glueContext.create_dynamic_frame.from_options(**options)and uses job bookmarks to handle deltas.

What is happening that each thread starts and logs the table they should be loading. After the log entry comes the create_dynamic_frame.from_options() and all loads seem to stop there. Nothing happens from there and ultimately the job timeouts. The next step would be to write the result in an S3 bucket but that is not happening. Sometimes, when the job is re-deployed or executed several times manually, it completes, but that's really rare. This seems like a race condition of some sort...

Does Glue have any limitations / issues in using Spark threading? Does anyone have a properly functioning JDBC load running in multiple threads?

por
asked 2 years ago209 views
3 Answers
0

Hi , if the issue seems limited to JDBC load, have you tried to monitor also the source database ?

Are you sure you are not experiencing queries timeouts while reading from the Database? Depending on the options you are using, and the number of concurrent loads (is it 5? or more?) you might be submitting more queries than you expect to the database and it might be starting to slowing down.

Try to monitor the jobs loking at the glue metrics and at Spark UI and at the same time monitor the DB you read from to understand where the slow down may be actually occurring.

Hope hit helps,

AWS
EXPERT
answered 2 years ago
  • Hi,

    Thanks for the tips! We've monitored the source database and there's nothing of significance there. In the beginning of the loads the queries execute and we can see them, but then, they stop. It looks like the database returns the result but Glue never processes the results.

    We're now in the process of dropping Glue and handling the bookmarking by ourselves.

0

I just realized we have working multi-threaded processing jobs. They load data from S3 and after transformations etc. dump the data back in S3 with new format. Those are running nicely, so the issue seems to be specific to JDBC loads.

por
answered 2 years ago
0

Another note - this seems to be related to number of tables. The failing job loads around 850 tables. The same code works fine with loads containing around or less than 100 tables. This might support the race condition theory.

por
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions