Glue 4.0 Iceberg issues

0

Hello,

I have two questions issues:

spark = (
    SparkSession.builder
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.glue_catalog.warehouse", f"s3://co-raw-sales-dev")
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
    .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
    .enableHiveSupport()
    .getOrCreate()
)

df.writeTo("glue_catalog.co_raw_sales_dev.new_test").using("iceberg").create()

CREATED TABLE DDL:

CREATE TABLE co_raw_sales_dev.new_test (
  id bigint,
  name string,
  points bigint)
LOCATION 's3://co-raw-sales-dev**//**new_test'
TBLPROPERTIES (
  'table_type'='iceberg'
);

The problem I am having is that there is double // in location between bucket and table name in s3.

This one wokrs: df.writeTo("glue_catalog.co_raw_sales_dev.new_test2").using("iceberg").create()

but if I remove "glue_catalog" like: df.writeTo("co_raw_sales_dev.new_test2").using("iceberg").create()

I am getting error : An error occurred while calling o339.create. Table implementation does not support writes: co_raw_sales_dev.new_test2

am I missing some parameter in SparkSession config?

Thank you, Adas.

gefragt vor einem Jahr1131 Aufrufe
1 Antwort
0
Akzeptierte Antwort
  1. I doubt you can make it work correctly, s3 allows that but for the filesystem it means you have a directory without name, I would move the files and avoid issues in the future (even if you can solve it now).
  2. You need to specify "glue_catalog" so it knows it's the Iceberg catalog, otherwise it will treat it as a regular table.
profile pictureAWS
EXPERTE
beantwortet vor einem Jahr
  • Thank you Gonzalo for explanation.

    One more question, I am running query below from Glue:

    query = f"""
    CREATE TABLE IF NOT EXISTS glue_catalog.{std_database}.{std_table}
    USING iceberg
    LOCATION 's3://{std_bucket}/{std_table}'
    PARTITIONED BY (id)
    TBLPROPERTIES (
      'format'='parquet',
      'write_compression'='snappy'
    )
    AS SELECT * FROM source_df
    """
    spark.sql(query)
    

    I inserted data, I can query it everything seems fine, but when running "SHOW CREATE TABLE {std_database}.{std_table}" on Athena, I am getting error: CREATE TABLE statement cannot be generated because table has unsupported properties.

    Both properties I added are described in: https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-creating-tables.html What might be wrong?

  • Maybe instead of "using" use the table property 'table_type' ='ICEBERG', otherwise it works for me

  • Hi Honzalo,

    I tested various scenarios, and the only way "SHOW CREATE TABLE" on Athena works, when I create table on PySpark without any TBLPROPERTIES.

    Also "SHOW CREATE TABLE" answer on Athena and PySpark is different:

    Athena:

    CREATE TABLE co_raw_sales_dev.test1(
      id bigint,
      name string,
      points bigint,
      created string,
      updated string)
    PARTITIONED BY (`id`)
    LOCATION 's3://co-raw-sales-dev/test1'
    TBLPROPERTIES (
      'table_type'='iceberg'
    );
    

    PySpark:

    CREATE TABLE glue_catalog.co_raw_sales_dev.test1(
    id BIGINT,
    name STRING,
    points BIGINT,
    created STRING,
    updated STRING)
    USING iceberg
    PARTITIONED BY (id)
    LOCATION 's3://co-raw-sales-dev/test1'
    TBLPROPERTIES ( 
    	current-snapshot-id' = '5704046200302329156',  
    	'format' = 'iceberg/parquet',  
    	'format-version' = '1'
    )
    

    I think the problem here is that Glue 4.0 creates Iceberg format=1 and Athena is using Iceberg format=2.

  • Spark defaults to format-version 1, but it should work with 2

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen