- Newest
- Most votes
- Most comments
An "ICEBERG" table works with snapshots and metadatas files.
Globally, it distributes the files differently than the standard EXTERNAL tables, even duplicates the data, and thanks to a set of metadatas in json+avro knows how to query the data correctly. (so there is more complexity to read all these files than reading a simple EXTERNAL TABLE through a simple Parquet Serde (that only read the Parquet file internal metadata for queries))
With a single example dataset, this is not too noticeable but you have to take into account that the use of S3 storage can quickly increase depending on the process with ICEBERG. If you perform a MERGE that applies many updates, and some deletes, you can quickly have the disk space on S3 increase by a factor, and the performance will be even less efficient (because data will be split into multiples snapshots/files).
(However using the OPTIMIZE command ... can then bring the performance back to the initial state of the ICEBERG table, by combining files into a new one, and update metadata to ignore old files)
There are advantages (much simplification of maintenance, query partitioned data without making explicit conditions on these partitions, ...) and disadvantages (less efficient performance in reading, storage space that increases rapidly, ...) to ICEBERG.
I have not managed to see an automatic disk space optimization at work using VACUUM and OPTIMIZE personally though, I just know that it uses less files afterwards and that deleting files that are no longer used does not break my iceberg table.
If ever the goal is to try to gain some performance, I would advise to set the compression of the PARQUET differently... But this would probably not change everything. (also if you already perform multiples delete/insert on your table, try an OPTIMIZE to fix performances)
Relevant content
- asked 4 months ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated 6 months ago