Glue ETL converts NULL values of Snowflake columns to 0

0

Hi,

I am using GlueETL version Spark 3.0 with Python version Glue Job Details

The ETL job has only 2 steps. I am using CodeGenConfiguration to auto-create the Spark script from my service backend.

"{\"sink-node-1\":{\"nodeId\":\"sink-node-1\",\"dataPreview\":false,\"previewAmount\":0,\"inputs\":[\"source-node-1\"],\"name\":\"organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e__id=62789a5b-78dd-4d41-ae96-1447674861a6__type=GDL\",\"generatedNodeName\":\"organization_id47e04d2824d347e79911bbfc071c754e__id62789a5b78dd4d41ae961447674861a6__typeGDL_sinknode1\",\"classification\":\"DataSink\",\"type\":\"S3\",\"streamingBatchInterval\":100,\"format\":\"parquet\",\"compression\":\"snappy\",\"path\":\"s3://x-bucket/event_etl_data/source_id=glueetl/schema_id=etl_raw_event/pipeline_id=fcf172f2-1cd1-4f9d-bdce-62b3b0c26696/organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e/model_name=test-sql-with-database-namez__version=None/\",\"partitionKeys\":[[\"year\"],[\"month\"],[\"day\"],[\"hour\"]],\"schemaChangePolicy\":{\"enableUpdateCatalog\":false,\"updateBehavior\":null,\"database\":null,\"table\":null},\"updateCatalogOptions\":\"none\",\"calculatedType\":\"\"},\"source-node-1\":{\"nodeId\":\"source-node-1\",\"dataPreview\":false,\"previewAmount\":0,\"inputs\":[],\"name\":\"organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e__id=9cf97c7b-ced8-4096-a7c3-2ca3560e0fd0__type=SNOWFLAKE\",\"generatedNodeName\":\"organization_id47e04d2824d347e79911bbfc071c754e__id9cf97c7bced84096a7c32ca3560e0fd0__typeSNOWFLAKE_sourcenode1\",\"classification\":\"DataSource\",\"type\":\"Connector\",\"isCatalog\":false,\"connectorName\":\"SNOWFLAKE\",\"connectionName\":\"organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e__id=9cf97c7b-ced8-4096-a7c3-2ca3560e0fd0__type=SNOWFLAKE\",\"connectionType\":\"custom.jdbc\",\"outputSchemas\":[],\"connectionTable\":null,\"query\":\"SELECT \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"A\\\" AS \\\"inputs__A\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"B\\\" AS \\\"inputs__B\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"C\\\" AS \\\"outputs__C\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"ID\\\" AS \\\"feedback_id\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"D\\\" AS \\\"timestamp\\\", year(SYSDATE()) AS \\\"year\\\", month(SYSDATE()) AS \\\"month\\\", day(SYSDATE()) AS \\\"day\\\", hour(SYSDATE()) AS \\\"hour\\\", SYSDATE() AS \\\"log_timestamp\\\" FROM \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\" ORDER BY \\\"D\\\"\",\"additionalOptions\":{\"filterPredicate\":\"\",\"partitionColumn\":null,\"lowerBound\":null,\"upperBound\":null,\"numPartitions\":null,\"jobBookmarkKeys\":[],\"jobBookmarkKeysSortOrder\":\"ASC\",\"dataTypeMapping\":{},\"filterPredicateArg\":[],\"dataTypeMappingArg\":[]},\"calculatedType\":\"\"}}"

Enter image description here

As you can see, I am using the Snowflake JDBC connector, and simply using S3DirectTarget to write the parquet files to S3 destination. However, any NULL values of numeric columns in the source table ends up with 0.0, and there is no way for me to tell whether these are actual 0.0s or falsely converted 0.0s. Without modifying the PySpark script since my backend service is dependent on CodeGenConfiguration, is there a way to make sure the NULL values do not get falsely converted?

Thanks, Kyle

feita há um ano203 visualizações
1 Resposta
0

Can you please check the data preview of the "Data source" components, they probably come already as "0.0" from the connector in which case you would need to update a support ticket with the connector provider on the marketplace.

profile pictureAWS
ESPECIALISTA
respondido há um ano

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas