Delete old parquet files of overwritten Iceberg table

I am trying to write a pyspark dataframe to S3 and the AWS data catalog using the Iceberg format and the pyspark.sql.DataFrameWriterV2 with the createOrReplace function. When I write the same dataframe twice one after another, I see that all parquet files on S3 exist twice with slightly different names (hashes) in each partition, however, when I read the table with SQL, i get the expected number of rows, which corresponds to the number of rows in the dataframe. Is there a way to automatically delete the overwritten/superseded parquet files?

주제

분석

태그

아마존 Athena AWS Glue

언어

English

Thomas Mueller

질문됨 4달 전605회 조회

1개 답변

최신
최다 투표
가장 많은 댓글

이 답변이 도움이 되었나요?커뮤니티가 여러분의 지식을 활용할 수 있도록 정답을 찬성하세요.

That's normal Iceberg behavior, it keeps the old files in case you want to get the data as it was in the past.
In the configuration you can tell it when to expire or you can force it, see: https://iceberg.apache.org/docs/latest/maintenance/ https://iceberg.apache.org/docs/latest/configuration/

전문가

Gonzalo Herreros

답변함 4달 전

Thomas Mueller
4달 전
Thanks for your answer, Gonzalo!

My problem is, that i need to physically delete data older than 10 years due to legal regulations. The age is defined by the value of a certain column in my tables, because i processed a large chunk of old data at once so that the time of writing the data does not correspond to the real age of the data. Do you have a hint for me how to implement that? My impression is, that the snapshot expire mechanism work only for the physical age of objects on S3.

Delete old parquet files of overwritten Iceberg table

관련 콘텐츠