Came across the following useful documentation:
https://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/best-practices-read.html#read-sort-order
I have a large table where I planned to have certain columns partitioned and another set of columns sorted, according to my end use case for a Quicksight dashboard where those sorted columns will be frequently used for count distinct operations on the dashboard.
I have the following code.
## Creates ICEBERG table in Athena and Glue Catalog first
final_df.limit(0).createOrReplaceTempView("temp_iceberg_table")
spark.sql(
f"""
CREATE TABLE IF NOT EXISTS glue_catalog.sandbox.test2
USING iceberg
PARTITIONED BY (ga_publication, ga_visit_date, ga_platform)
SORTED BY (ga_fullvisitorid, ga_session_key, ga_timestamp_hit)
LOCATION 's3://test2'
TBLPROPERTIES (
'write.format.default'='parquet',
'write.metadata.delete-after-commit.enabled'='true',
'write.metadata.previous-versions-max'='5',
'history.expire.max-snapshot-age-ms'='86400000',
'write.distribution-mode' = 'range',
'write.parquet.compression-codec' = 'zstd'
)
AS SELECT * from temp_iceberg_table;
"""
)
## Writes dataframe into created ICEBERG table, bulk backfill
final_ga_mysph_crsm_df.sortWithinPartitions("ga_fullvisitorid", "ga_session_key", "ga_timestamp_hit"
).writeTo(
"glue_catalog.consumerdiv_sandbox_kaiwei.hub_test2"
).using(
"iceberg"
).tableProperty(
"location", f"s3://test2"
).tableProperty(
"write.format.default", "parquet"
).partitionedBy(
"ga_publication", "ga_visit_date", "ga_platform"
).overwritePartitions()
Does my code meet the recommendations of that Sorting guideline?
That prescriptive guide section on setting sort order seem to imply use of another line of code.
You can set a hierarchical sort order at the table level by running a data definition language (DDL) statement with Spark. For available options, see the Iceberg documentation. After you set the sort order, writers will apply this sorting to subsequent data write operations in the Iceberg table.
For example, in tables that are partitioned by date (yyyy-mm-dd) where most of the queries filter by uuid, you can use the DDL option Write Distributed By Partition Locally Ordered
to make sure that Spark writes files with non-overlapping ranges.
Can't understand, hope someone can enlighten me if my code is doing it right. If I understand correctly, my second block in my code is using a different type of syntax and not spark.sql.
Hope whatever help provided can be in the same format as my current. I am still learning, hope to keep things simple and prioritise delivery on tight deadlines :D