Glue delete not working on Iceberg

0

Hallo!

I am using Glue 4.0 and would like to delete rows from Iceberg table. In order to get the deletion condition I need to fetch data from another table which is a dblink table and not Iceberg format.

Relevant Spark config info:

    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.glue_catalog.warehouse", f"s3://[REDACTED]/")
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
    .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
    .enableHiveSupport()

Additinal Job Parameters:

--datalake-formats: iceberg

When I am issuing this kind of Delete command:

spark.sql("DELETE FROM `glue_catalog.ice_db`.`ice_table` AS T WHERE EXISTS (SELECT col1 FROM `another_db_link`.`db_link_table` AS R WHERE T.col1 = R.col1)")

Get this error message:

An error occurred while calling o97.showString. Cannot support vectorized reads for column [uuid] optional binary uuid (STRING) = 1 with encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this table/file

The simple deletion is working fine, like this:

spark.sql("DELETE FROM `glue_catalog`.`ice_db.ice_table` AS T WHERE col1 = 1")

Do you have any idea?

已提問 1 年前檢視次數 648 次
1 個回答
1
已接受的答案

From the above issues where you are getting an error "An error occurred while calling o97.showString. Cannot support vectorized reads for column [uuid] optional binary uuid (STRING) = 1 with encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this table/file" , as iceberg does not use spark's vectorized reader, a solution is to set the parameter "read.parquet.vectorization.enabled" to false on the Glue table's Table properties itself, to avoid vectorized reads.

Could you please try this at your end? To do this navigate to your Glue tables page and choose you Glue table that is being accessed in the job. Later click Actions > Edit table. And add a new Table property with: key: read.parquet.vectorization.enabled value: false

AWS
Sahil_S
已回答 1 年前
AWS
專家
已審閱 1 年前
  • Thanks! One remark: It works for me only if add the table property via Glue.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南