Upgrading Glue Version from 3.0 to 4.0

0

Hello, I am upgrading Glue Version from 3.0 to 4.0. Using Hudi as datalake. I am getting below error - py4j.protocol.Py4JJavaError: An error occurred while calling o973.pyWriteDynamicFrame. : org.apache.hudi.exception.HoodieException: Config conflict(key current value existing value):

Config : "hoodie.datasource.hive_sync.table": tablename, # Hudi table name cataloged in Glue "className": "org.apache.hudi", "hoodie.datasource.hive_sync.use_jdbc": "false", "hoodie.datasource.write.recordkey.field": recordkey, # Hudi key for identifying unique records "hoodie.table.name": tablename, # Hudi table name cataloged in Glue "hoodie.consistency.check.enabled": "true", "path": f"{tablepath}/{tablename}", "hoodie.datasource.write.hive_style_partitioning": "false", "hoodie.datasource.write.precombine.field": precombinekey, #When two records have the same key value, we will pick the one with the largest value for the precombine field "hoodie.upsert.shuffle.parallelism" : "10", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.hive_sync.database": database, # Hudi database name cataloged in Glue "hoodie.datasource.write.operation": "upsert", "hoodie.datasource.write.row.writer.enable" : "true", "hoodie.enable.data.skipping": "true", "hoodie.metadata.enable": "false", "hoodie.metadata.index.column.stats.enable": "true", "hoodie.datasource.write.table.type" : "COPY_ON_WRITE" "hoodie.datasource.hive_sync.partition_extractor_class" = "org.apache.hudi.hive.MultiPartKeysValueExtractor" "hoodie.datasource.write.keygenerator.class" = "org.apache.hudi.keygen.ComplexKeyGenerator" "hoodie.datasource.write.hive_style_partitioning" = "true" "hoodie.datasource.hive_sync.partition_fields" = partitionkey "hoodie.datasource.write.partitionpath.field" = partitionkey

Asmita
asked 5 months ago249 views
2 Answers
1

Hello ,

I Understand that you are trying to use glue to write a Hudi table however its failing with glue version 4.

I Referred the Aws Document to create a Hudi table from a sample data existing in s3 that has been catalogued [+] https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html

Job Parameters provided: -->Key= --conf Value = spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false

-->Key= --datalake-formats Value = hudi


import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args)

AmazonS3_node = glueContext.create_dynamic_frame.from_catalog( database="<db_name_in_catalog>", table_name="<table_name_in_catalog>", transformation_ctx="AmazonS3_node", )

dataFrame=AmazonS3_node.toDF()

additional_options={ "hoodie.table.name": "<your_table_name>", "hoodie.datasource.write.storage.type": "COPY_ON_WRITE", "hoodie.datasource.write.operation": "upsert", "hoodie.datasource.write.recordkey.field": "<your_recordkey_field>", "hoodie.datasource.write.precombine.field": "<your_precombine_field>", "hoodie.datasource.write.partitionpath.field": "<your_partitionkey_field>", "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.hive_sync.database": "<your_database_name>", "hoodie.datasource.hive_sync.table": "<your_table_name>", "hoodie.datasource.hive_sync.partition_fields": "<your_partitionkey_field>", "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.hive_sync.use_jdbc": "false", "hoodie.datasource.hive_sync.mode": "hms", "path": "s3://<s3Path/>" }

dataFrame.write.format("hudi")
.options(**additional_options)
.mode("overwrite")
.save()

job.commit()


-->Alternative Way

Additionally you can use Glue Visual studio to automatically generate a code for writing a Hudi table in s3

Thank You!

AWS
Sahil_S
answered 5 months ago
0

Hi, I am able to work with this solution. Thank you. Unfortunately the error is resolved , table structure is created but the records are not getting inserted in the hudi. I can not see any exception in the log too. Can you give me some hint?

Asmita
answered 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions