Partitioning create a duplicate column

Question

Hello Team,

We are doing archival.

We are streaming data from Oracle to S3 via kafka. 
We have source (Debezium) and sink(S3 Sink) connectors and the data gets stored in s3 based on on field partitioner  in the kafka record called - template_name.

In AWS Glue we created a crawler to create tables based on the data stored in s3. Crawler creates table along with partitioner name 'template_name'. Now we have two columns named 'template_name' and because of this we are not able to query the database and started getting duplicate column error.

Answer

Hello,

This is a known issue with the Kafka S3 sink connector where when you choose an existing column in the record for partitioning using **partition.field.name** property in the connector. This will create your files in S3 partitions and also the same partition column will be present in the output data files as well.

So, when a Glue crawler, crawls the partitioned output S3 location, the resulting table will be having duplicate columns which cannot be queried from Athena or Hive. Please refer the below links for more info

**Known Issues with the S3 sink connector**

https://github.com/confluentinc/kafka-connect-hdfs/issues/221

https://github.com/confluentinc/kafka-connect-hdfs/issues/238

https://github.com/confluentinc/kafka-connect-storage-cloud/issues/387

**Can you try the below steps from AWS glue crawler side ?**

1. Delete the duplicate column from the Glue table
`AWS glue console -> DataCatalog tables -> Choose your table -> Edit Schema -> Delete the duplicate column from the schema`

2. Update your crawler properties like below

![Enter image description here](/media/postImages/original/IMBYsSPlcUQiCxo5PpfC-nkA)

In the above process we are fixing the table schema and forcing the crawler to not update the schema of your Glue catalog table. Please refer this [doc](https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-schema-changes-prevent)

Partitioning create a duplicate column

相关内容