Glue delete not working on Iceberg

0

Hallo!

I am using Glue 4.0 and would like to delete rows from Iceberg table. In order to get the deletion condition I need to fetch data from another table which is a dblink table and not Iceberg format.

Relevant Spark config info:

    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.glue_catalog.warehouse", f"s3://[REDACTED]/")
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
    .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
    .enableHiveSupport()

Additinal Job Parameters:

--datalake-formats: iceberg

When I am issuing this kind of Delete command:

spark.sql("DELETE FROM `glue_catalog.ice_db`.`ice_table` AS T WHERE EXISTS (SELECT col1 FROM `another_db_link`.`db_link_table` AS R WHERE T.col1 = R.col1)")

Get this error message:

An error occurred while calling o97.showString. Cannot support vectorized reads for column [uuid] optional binary uuid (STRING) = 1 with encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this table/file

The simple deletion is working fine, like this:

spark.sql("DELETE FROM `glue_catalog`.`ice_db.ice_table` AS T WHERE col1 = 1")

Do you have any idea?

demandé il y a un an648 vues
1 réponse
1
Réponse acceptée

From the above issues where you are getting an error "An error occurred while calling o97.showString. Cannot support vectorized reads for column [uuid] optional binary uuid (STRING) = 1 with encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this table/file" , as iceberg does not use spark's vectorized reader, a solution is to set the parameter "read.parquet.vectorization.enabled" to false on the Glue table's Table properties itself, to avoid vectorized reads.

Could you please try this at your end? To do this navigate to your Glue tables page and choose you Glue table that is being accessed in the job. Later click Actions > Edit table. And add a new Table property with: key: read.parquet.vectorization.enabled value: false

AWS
Sahil_S
répondu il y a un an
AWS
EXPERT
vérifié il y a un an
  • Thanks! One remark: It works for me only if add the table property via Glue.

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions