Does AWS Glue have an issue with multi-threaded loads

0

I'm experiencing a strange issue with AWS Glue. In order to speed up loads, I'm running multiple threads (spark.scheduler.mode=FAIR, Python multiprocessing.pool.ThreadPool(thread_count=5)). Each thread loads up a specific JDBC database table (glueContext.create_dynamic_frame.from_options(**options)and uses job bookmarks to handle deltas.

What is happening that each thread starts and logs the table they should be loading. After the log entry comes the create_dynamic_frame.from_options() and all loads seem to stop there. Nothing happens from there and ultimately the job timeouts. The next step would be to write the result in an S3 bucket but that is not happening. Sometimes, when the job is re-deployed or executed several times manually, it completes, but that's really rare. This seems like a race condition of some sort...

Does Glue have any limitations / issues in using Spark threading? Does anyone have a properly functioning JDBC load running in multiple threads?

por
질문됨 2년 전213회 조회
3개 답변
0

Hi , if the issue seems limited to JDBC load, have you tried to monitor also the source database ?

Are you sure you are not experiencing queries timeouts while reading from the Database? Depending on the options you are using, and the number of concurrent loads (is it 5? or more?) you might be submitting more queries than you expect to the database and it might be starting to slowing down.

Try to monitor the jobs loking at the glue metrics and at Spark UI and at the same time monitor the DB you read from to understand where the slow down may be actually occurring.

Hope hit helps,

AWS
전문가
답변함 2년 전
  • Hi,

    Thanks for the tips! We've monitored the source database and there's nothing of significance there. In the beginning of the loads the queries execute and we can see them, but then, they stop. It looks like the database returns the result but Glue never processes the results.

    We're now in the process of dropping Glue and handling the bookmarking by ourselves.

0

I just realized we have working multi-threaded processing jobs. They load data from S3 and after transformations etc. dump the data back in S3 with new format. Those are running nicely, so the issue seems to be specific to JDBC loads.

por
답변함 2년 전
0

Another note - this seems to be related to number of tables. The failing job loads around 850 tables. The same code works fine with loads containing around or less than 100 tables. This might support the race condition theory.

por
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠