Saltar al contenido

[Bug Report] Bug found on Sagemaker Visual ETL flows

0

**Bug Description: ** When creating the following flow, SageMaker produces a python script with a syntax error.

Steps to reproduce:

  1. Create a source S3 Bucket containing a csv file and a iceberg tables destination table in LakeHouse: Catalog name: s3tablescatalog/iceberg Database: dwh Table: test
  2. Create a visual ETL flow in "spark.compatibility" mode:
  • Read from a table in your Lakehouse catalogs (S3 Bucket with the raw file)
  • Filter data based on set conditions.
  • Choose specific columns to keep in dataset.
  • Change data types or add custom expressions to modify column values.
  • Write into a table in your Lakehouse catalogs (Iceberg table)
  1. Run the script

Output "SyntaxError: invalid syntax. Perhaps you forgot a comma? (filename.py, line 38)

Root cause analysis The visual editor generates the following line of code where the catalog name is not quoted:

all_columns = get_table_columns(spark, s3tablescatalog_iceberg, "dwh", "test") 

Best,

Wolfgang.

preguntada hace 3 meses40 visualizaciones
1 Respuesta
0

Thank you for reporting this bug in SageMaker Visual ETL flows. I can confirm this is indeed a syntax error in the generated Python code.

The issue occurs when working with catalog names that contain special characters like underscores. The Visual ETL flow is incorrectly generating Python code that treats the catalog name as a variable rather than a string when referencing the S3 Tables Catalog for Apache Iceberg.

In the generated code, the catalog name s3tablescatalog_iceberg is not properly quoted, causing Python to interpret it as a variable name rather than a string literal. The correct syntax should have the catalog name in quotes, something like:

all_columns = get_table_columns(spark, "s3tablescatalog_iceberg", "dwh", "test")

When working with Apache Iceberg tables in Spark SQL, catalog names, namespaces, and table names should be properly quoted. This is especially important when the names contain special characters like underscores.

As a temporary workaround, you might want to avoid using underscores or special characters in your catalog names until this bug is fixed. Alternatively, if you have access to edit the generated Python script before execution, you could manually add the necessary quotes around the catalog name.

I'll make sure this bug report is forwarded to the appropriate team for resolution.
Sources
Accessing Amazon S3 tables with the Amazon S3 Tables Catalog for Apache Iceberg - Amazon Simple Storage Service
Using both external data and fine-grained data in Amazon SageMaker Unified Studio visual ETL jobs - Amazon SageMaker Unified Studio

respondido hace 3 meses

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.