Glue job does not sort data without "Automatically scale the number of workers" check

0

I have created an ETL job executing processes below using AWS Glue Studio.

  1. Reading a data source from a Oracle database table through a table of Glue Data Catalog.
  2. Executing the SQL statement "select * from tableA order by col1".
  3. Repartitioning the DynamicFrame to 1 output.
  4. Writing the DynamicFrame to a csv file.

With this job, if I set "Automatically scale the number of workers" checked, the output data is sorted.

But I set the option unchecked, the output data is NOT sorted ("order by" clause doesn't work).

What is the cause of this phenomenon?

Thank you.

질문됨 일 년 전363회 조회
1개 답변
1
수락된 답변

Hi ,

Small disclaimer: I do not have tested it, so my theory is not proven.

My understanding is that you are repartitioning the data to 1 partition (to have 1 file) using the repartition or coalesce command.

Now you have to consider that Spark run in a distributed cluster and each partition is managed by a different executor so in a normal execution when you are reading the data from Oracle even if it is sorted during the ingestion it may be split and re-merged after without conserving the sorting order. This is why without Autoscaling checked the data is not sorted.

Now , when Autoscaling is enabled you are telling Glue to start only the number of executors are actually needed. This combined with Spark Lazy evaluation and your repartition(1) could bring glue to start only one executor and thus read and write the data in your sorted order.

To validate it you could look at the Spark UI for the 2 jobs and see how many executor are running at anytime during the Job.

hope this helps,

AWS
전문가
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠