Glue ETL converts NULL values of Snowflake columns to 0

0

Hi,

I am using GlueETL version Spark 3.0 with Python version Glue Job Details

The ETL job has only 2 steps. I am using CodeGenConfiguration to auto-create the Spark script from my service backend.

"{\"sink-node-1\":{\"nodeId\":\"sink-node-1\",\"dataPreview\":false,\"previewAmount\":0,\"inputs\":[\"source-node-1\"],\"name\":\"organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e__id=62789a5b-78dd-4d41-ae96-1447674861a6__type=GDL\",\"generatedNodeName\":\"organization_id47e04d2824d347e79911bbfc071c754e__id62789a5b78dd4d41ae961447674861a6__typeGDL_sinknode1\",\"classification\":\"DataSink\",\"type\":\"S3\",\"streamingBatchInterval\":100,\"format\":\"parquet\",\"compression\":\"snappy\",\"path\":\"s3://x-bucket/event_etl_data/source_id=glueetl/schema_id=etl_raw_event/pipeline_id=fcf172f2-1cd1-4f9d-bdce-62b3b0c26696/organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e/model_name=test-sql-with-database-namez__version=None/\",\"partitionKeys\":[[\"year\"],[\"month\"],[\"day\"],[\"hour\"]],\"schemaChangePolicy\":{\"enableUpdateCatalog\":false,\"updateBehavior\":null,\"database\":null,\"table\":null},\"updateCatalogOptions\":\"none\",\"calculatedType\":\"\"},\"source-node-1\":{\"nodeId\":\"source-node-1\",\"dataPreview\":false,\"previewAmount\":0,\"inputs\":[],\"name\":\"organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e__id=9cf97c7b-ced8-4096-a7c3-2ca3560e0fd0__type=SNOWFLAKE\",\"generatedNodeName\":\"organization_id47e04d2824d347e79911bbfc071c754e__id9cf97c7bced84096a7c32ca3560e0fd0__typeSNOWFLAKE_sourcenode1\",\"classification\":\"DataSource\",\"type\":\"Connector\",\"isCatalog\":false,\"connectorName\":\"SNOWFLAKE\",\"connectionName\":\"organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e__id=9cf97c7b-ced8-4096-a7c3-2ca3560e0fd0__type=SNOWFLAKE\",\"connectionType\":\"custom.jdbc\",\"outputSchemas\":[],\"connectionTable\":null,\"query\":\"SELECT \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"A\\\" AS \\\"inputs__A\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"B\\\" AS \\\"inputs__B\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"C\\\" AS \\\"outputs__C\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"ID\\\" AS \\\"feedback_id\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"D\\\" AS \\\"timestamp\\\", year(SYSDATE()) AS \\\"year\\\", month(SYSDATE()) AS \\\"month\\\", day(SYSDATE()) AS \\\"day\\\", hour(SYSDATE()) AS \\\"hour\\\", SYSDATE() AS \\\"log_timestamp\\\" FROM \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\" ORDER BY \\\"D\\\"\",\"additionalOptions\":{\"filterPredicate\":\"\",\"partitionColumn\":null,\"lowerBound\":null,\"upperBound\":null,\"numPartitions\":null,\"jobBookmarkKeys\":[],\"jobBookmarkKeysSortOrder\":\"ASC\",\"dataTypeMapping\":{},\"filterPredicateArg\":[],\"dataTypeMappingArg\":[]},\"calculatedType\":\"\"}}"

Enter image description here

As you can see, I am using the Snowflake JDBC connector, and simply using S3DirectTarget to write the parquet files to S3 destination. However, any NULL values of numeric columns in the source table ends up with 0.0, and there is no way for me to tell whether these are actual 0.0s or falsely converted 0.0s. Without modifying the PySpark script since my backend service is dependent on CodeGenConfiguration, is there a way to make sure the NULL values do not get falsely converted?

Thanks, Kyle

已提問 1 年前檢視次數 203 次
1 個回答
0

Can you please check the data preview of the "Data source" components, they probably come already as "0.0" from the connector in which case you would need to update a support ticket with the connector provider on the marketplace.

profile pictureAWS
專家
已回答 1 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南