Glue delete not working on Iceberg

0

Hallo!

I am using Glue 4.0 and would like to delete rows from Iceberg table. In order to get the deletion condition I need to fetch data from another table which is a dblink table and not Iceberg format.

Relevant Spark config info:

    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.glue_catalog.warehouse", f"s3://[REDACTED]/")
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
    .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
    .enableHiveSupport()

Additinal Job Parameters:

--datalake-formats: iceberg

When I am issuing this kind of Delete command:

spark.sql("DELETE FROM `glue_catalog.ice_db`.`ice_table` AS T WHERE EXISTS (SELECT col1 FROM `another_db_link`.`db_link_table` AS R WHERE T.col1 = R.col1)")

Get this error message:

An error occurred while calling o97.showString. Cannot support vectorized reads for column [uuid] optional binary uuid (STRING) = 1 with encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this table/file

The simple deletion is working fine, like this:

spark.sql("DELETE FROM `glue_catalog`.`ice_db.ice_table` AS T WHERE col1 = 1")

Do you have any idea?

posta un anno fa648 visualizzazioni
1 Risposta
1
Risposta accettata

From the above issues where you are getting an error "An error occurred while calling o97.showString. Cannot support vectorized reads for column [uuid] optional binary uuid (STRING) = 1 with encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this table/file" , as iceberg does not use spark's vectorized reader, a solution is to set the parameter "read.parquet.vectorization.enabled" to false on the Glue table's Table properties itself, to avoid vectorized reads.

Could you please try this at your end? To do this navigate to your Glue tables page and choose you Glue table that is being accessed in the job. Later click Actions > Edit table. And add a new Table property with: key: read.parquet.vectorization.enabled value: false

AWS
Sahil_S
con risposta un anno fa
AWS
ESPERTO
verificato un anno fa
  • Thanks! One remark: It works for me only if add the table property via Glue.

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande