AWS Prescriptive Guide on Iceberg, how to write to S3 sorted within partition?

0

Came across the following useful documentation: https://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/best-practices-read.html#read-sort-order

I have a large table where I planned to have certain columns partitioned and another set of columns sorted, according to my end use case for a Quicksight dashboard where those sorted columns will be frequently used for count distinct operations on the dashboard.

I have the following code.

## Creates ICEBERG table in Athena and Glue Catalog first
final_df.limit(0).createOrReplaceTempView("temp_iceberg_table")
spark.sql(
            f"""
            CREATE TABLE IF NOT EXISTS glue_catalog.sandbox.test2
            USING iceberg
            PARTITIONED BY (ga_publication, ga_visit_date, ga_platform)
            SORTED BY (ga_fullvisitorid, ga_session_key, ga_timestamp_hit)
            LOCATION 's3://test2'
            TBLPROPERTIES (
                'write.format.default'='parquet',
                'write.metadata.delete-after-commit.enabled'='true',
                'write.metadata.previous-versions-max'='5',
                'history.expire.max-snapshot-age-ms'='86400000',
                'write.distribution-mode' = 'range',
                'write.parquet.compression-codec' = 'zstd'
                )
            AS SELECT * from temp_iceberg_table;
            """
        )

## Writes dataframe into created ICEBERG table, bulk backfill
final_ga_mysph_crsm_df.sortWithinPartitions("ga_fullvisitorid", "ga_session_key", "ga_timestamp_hit"
        ).writeTo(
            "glue_catalog.consumerdiv_sandbox_kaiwei.hub_test2"
        ).using(
            "iceberg"
        ).tableProperty(
            "location", f"s3://test2"
        ).tableProperty(
            "write.format.default", "parquet"
        ).partitionedBy(
            "ga_publication", "ga_visit_date", "ga_platform"
        ).overwritePartitions()

Does my code meet the recommendations of that Sorting guideline?

That prescriptive guide section on setting sort order seem to imply use of another line of code.

You can set a hierarchical sort order at the table level by running a data definition language (DDL) statement with Spark. For available options, see the Iceberg documentation. After you set the sort order, writers will apply this sorting to subsequent data write operations in the Iceberg table.

For example, in tables that are partitioned by date (yyyy-mm-dd) where most of the queries filter by uuid, you can use the DDL option Write Distributed By Partition Locally Ordered to make sure that Spark writes files with non-overlapping ranges.

Can't understand, hope someone can enlighten me if my code is doing it right. If I understand correctly, my second block in my code is using a different type of syntax and not spark.sql.

Hope whatever help provided can be in the same format as my current. I am still learning, hope to keep things simple and prioritise delivery on tight deadlines :D

Kai
asked 15 days ago30 views
1 Answer
1

Hello,

checking the first part of code, it indicates you're trying to use spark.sql to create a sorted iceberg table in glue catalog and athena and I believe it should work considering the fact that supported method to create a sorted table is to use sparksql since athena doesn't support creating iceberg table using sorted_by property directly as of now.

The next part of code implicates, inserting the dataframe/data into the created sorted table directly using pyspark. Going by the code, it seems fine with the properties that are defined, although, I would recommend you testing on a sample data to check if the final use case is achieved.

Would also recommend going through the docs- https://iceberg.apache.org/docs/1.5.1/spark-ddl/#alter-table-write-ordered-by to check if they are useful if you want to use spark.sql to directly write/update the tables.

Thanks!

AWS
SUPPORT ENGINEER
answered 15 days ago
profile picture
EXPERT
reviewed 14 days ago
profile picture
EXPERT
reviewed 15 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions