Partitioning create a duplicate column

0

Hello Team,

We are doing archival.

We are streaming data from Oracle to S3 via kafka. We have source (Debezium) and sink(S3 Sink) connectors and the data gets stored in s3 based on on field partitioner in the kafka record called - template_name.

In AWS Glue we created a crawler to create tables based on the data stored in s3. Crawler creates table along with partitioner name 'template_name'. Now we have two columns named 'template_name' and because of this we are not able to query the database and started getting duplicate column error.

已提问 9 个月前505 查看次数
1 回答
0

Hello,

This is a known issue with the Kafka S3 sink connector where when you choose an existing column in the record for partitioning using partition.field.name property in the connector. This will create your files in S3 partitions and also the same partition column will be present in the output data files as well.

So, when a Glue crawler, crawls the partitioned output S3 location, the resulting table will be having duplicate columns which cannot be queried from Athena or Hive. Please refer the below links for more info

Known Issues with the S3 sink connector

https://github.com/confluentinc/kafka-connect-hdfs/issues/221

https://github.com/confluentinc/kafka-connect-hdfs/issues/238

https://github.com/confluentinc/kafka-connect-storage-cloud/issues/387

Can you try the below steps from AWS glue crawler side ?

  1. Delete the duplicate column from the Glue table AWS glue console -> DataCatalog tables -> Choose your table -> Edit Schema -> Delete the duplicate column from the schema

  2. Update your crawler properties like below

Enter image description here

In the above process we are fixing the table schema and forcing the crawler to not update the schema of your Glue catalog table. Please refer this doc

AWS
支持工程师
已回答 9 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则