- 最新
- 投票最多
- 评论最多
There are several ways to create an Iceberg table with partitioning, depending on which tool you're using:
Using Amazon Athena SQL:
CREATE TABLE iceberg_table (
time timestamp,
foo string
)
PARTITIONED BY (foo)
LOCATION 's3://your-bucket/your-path/'
TBLPROPERTIES ('table_type' = 'ICEBERG')
Note that when using the PARTITIONED BY clause in Athena, the columns used for partitioning must be specified in the column declarations first, and the column type should not be included in the PARTITIONED BY clause.
You can also use hidden partitioning with transforms in Athena:
CREATE TABLE iceberg_table (
time timestamp,
foo string
)
PARTITIONED BY (day(time), foo)
LOCATION 's3://your-bucket/your-path/'
TBLPROPERTIES ('table_type' = 'ICEBERG')
Using PyIceberg:
from pyiceberg.schema import Schema from pyiceberg.types import NestedField, StringType, TimestampType from pyiceberg.partitioning import PartitionSpec, PartitionField # Define schema schema = Schema( NestedField(field_id=1, name="time", field_type=TimestampType(), required=True), NestedField(field_id=2, name="foo", field_type=StringType(), required=True) ) # Define partition spec partition_spec = PartitionSpec( PartitionField( source_id=2, # Refers to "foo" field_id=1000, transform=IdentityTransform(), name="foo" ) ) # Create table glue_catalog.create_table( identifier="database_name.table_name", schema=schema, location="s3://your-bucket/your-path/", partition_spec=partition_spec )
Using Apache Spark:
dataFrame.writeTo("catalog_name.database_name.table_name")
.tableProperty("format-version", "2")
.partitionedBy("foo")
.createOrReplace()
The error in your example is that you're including the column type "string" in the PARTITIONED BY clause, which is not allowed. The correct syntax should exclude the data type in the partition specification.
Sources
Create Iceberg tables - Amazon Athena
Getting started with Iceberg tables in Amazon Athena SQL - AWS Prescriptive Guidance
Working with Iceberg tables by using PyIceberg - AWS Prescriptive Guidance
Working with Iceberg tables by using Apache Spark - AWS Prescriptive Guidance
The pyIceberg implementation is quite unfortunate. Having to define the schema first in order to sort and/or partition table is quite a sub optimal solution. I know it's good practice to first define the schema, but this is not possible for many source systems when dealing with 100's or 1000's of tables.
相关内容
- AWS 官方已更新 1 年前
