Parallelize writing to Iceberg tables in Glue

0

I am creating my Iceberg table and inserting full dataframes into it using the instructions under https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout. I observe, that during the long time of writing (starting at ~13:30), only one Executor remains active:

Screenshot from Glue Job Run Metrics

Is there a way to parallelize writing in order to speed it up and to not take longer than the rest of the Glue job?

  • Check in SparkUI what operation is doing, my impression is that the writing if done by 13:30 and that is doing some table maintenance

  • @Gonzalo Herreros, no, I checked this. The table was empty before writing and at 13:30 (and even at 14h30) I do not yet see any data on S3.

asked a year ago787 views
1 Answer
0

Hi,

Have you tried to repartition your DataFrame before writing? I have seen this on the past and it was more a thing of Spark and the number of partitions more than an Iceberg thing.

Spark repartition docs

Bests

profile pictureAWS
answered a year ago
  • While that is the solution for plain tables, in the case of Iceberg it does many operations that needs to decide based on the files it needs to update the file size configuration, so no longer one partition goes to one file.

  • Actually, this can be true. I plan to extend my table once per day and it is partitioned by the column indicating, when a row was inserted/modified. Thus, I get only one partition per day or per run of the Glue job. In my case this is a client's requirement. Can I improve something else in this case?

  • Seems that you do not have many options... The repartition route is not going to be useful ( if you do not change the distribution mode parameter and that is going to carry out other set of problems).

    https://github.com/apache/iceberg/issues/7406

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions