- Newest
- Most votes
- Most comments
Hi,
Have you tried to repartition your DataFrame before writing? I have seen this on the past and it was more a thing of Spark and the number of partitions more than an Iceberg thing.
Bests
While that is the solution for plain tables, in the case of Iceberg it does many operations that needs to decide based on the files it needs to update the file size configuration, so no longer one partition goes to one file.
Actually, this can be true. I plan to extend my table once per day and it is partitioned by the column indicating, when a row was inserted/modified. Thus, I get only one partition per day or per run of the Glue job. In my case this is a client's requirement. Can I improve something else in this case?
Seems that you do not have many options... The repartition route is not going to be useful ( if you do not change the distribution mode parameter and that is going to carry out other set of problems).
Relevant content
- Accepted Answerasked 9 months ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 7 months ago
- AWS OFFICIALUpdated 2 months ago
Check in SparkUI what operation is doing, my impression is that the writing if done by 13:30 and that is doing some table maintenance
@Gonzalo Herreros, no, I checked this. The table was empty before writing and at 13:30 (and even at 14h30) I do not yet see any data on S3.