Glue delete not working on Iceberg

0

Hallo!

I am using Glue 4.0 and would like to delete rows from Iceberg table. In order to get the deletion condition I need to fetch data from another table which is a dblink table and not Iceberg format.

Relevant Spark config info:

    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.glue_catalog.warehouse", f"s3://[REDACTED]/")
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
    .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
    .enableHiveSupport()

Additinal Job Parameters:

--datalake-formats: iceberg

When I am issuing this kind of Delete command:

spark.sql("DELETE FROM `glue_catalog.ice_db`.`ice_table` AS T WHERE EXISTS (SELECT col1 FROM `another_db_link`.`db_link_table` AS R WHERE T.col1 = R.col1)")

Get this error message:

An error occurred while calling o97.showString. Cannot support vectorized reads for column [uuid] optional binary uuid (STRING) = 1 with encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this table/file

The simple deletion is working fine, like this:

spark.sql("DELETE FROM `glue_catalog`.`ice_db.ice_table` AS T WHERE col1 = 1")

Do you have any idea?

asked a year ago621 views
1 Answer
1
Accepted Answer

From the above issues where you are getting an error "An error occurred while calling o97.showString. Cannot support vectorized reads for column [uuid] optional binary uuid (STRING) = 1 with encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this table/file" , as iceberg does not use spark's vectorized reader, a solution is to set the parameter "read.parquet.vectorization.enabled" to false on the Glue table's Table properties itself, to avoid vectorized reads.

Could you please try this at your end? To do this navigate to your Glue tables page and choose you Glue table that is being accessed in the job. Later click Actions > Edit table. And add a new Table property with: key: read.parquet.vectorization.enabled value: false

AWS
Sahil_S
answered a year ago
AWS
EXPERT
reviewed a year ago
  • Thanks! One remark: It works for me only if add the table property via Glue.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions