Upgrading Glue Version from 3.0 to 4.0

0

Hello, I am upgrading Glue Version from 3.0 to 4.0. Using Hudi as datalake. I am getting below error - py4j.protocol.Py4JJavaError: An error occurred while calling o973.pyWriteDynamicFrame. : org.apache.hudi.exception.HoodieException: Config conflict(key current value existing value):

Config : "hoodie.datasource.hive_sync.table": tablename, # Hudi table name cataloged in Glue "className": "org.apache.hudi", "hoodie.datasource.hive_sync.use_jdbc": "false", "hoodie.datasource.write.recordkey.field": recordkey, # Hudi key for identifying unique records "hoodie.table.name": tablename, # Hudi table name cataloged in Glue "hoodie.consistency.check.enabled": "true", "path": f"{tablepath}/{tablename}", "hoodie.datasource.write.hive_style_partitioning": "false", "hoodie.datasource.write.precombine.field": precombinekey, #When two records have the same key value, we will pick the one with the largest value for the precombine field "hoodie.upsert.shuffle.parallelism" : "10", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.hive_sync.database": database, # Hudi database name cataloged in Glue "hoodie.datasource.write.operation": "upsert", "hoodie.datasource.write.row.writer.enable" : "true", "hoodie.enable.data.skipping": "true", "hoodie.metadata.enable": "false", "hoodie.metadata.index.column.stats.enable": "true", "hoodie.datasource.write.table.type" : "COPY_ON_WRITE" "hoodie.datasource.hive_sync.partition_extractor_class" = "org.apache.hudi.hive.MultiPartKeysValueExtractor" "hoodie.datasource.write.keygenerator.class" = "org.apache.hudi.keygen.ComplexKeyGenerator" "hoodie.datasource.write.hive_style_partitioning" = "true" "hoodie.datasource.hive_sync.partition_fields" = partitionkey "hoodie.datasource.write.partitionpath.field" = partitionkey

Asmita
gefragt vor 5 Monaten275 Aufrufe
2 Antworten
1

Hello ,

I Understand that you are trying to use glue to write a Hudi table however its failing with glue version 4.

I Referred the Aws Document to create a Hudi table from a sample data existing in s3 that has been catalogued [+] https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html

Job Parameters provided: -->Key= --conf Value = spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false

-->Key= --datalake-formats Value = hudi


import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args)

AmazonS3_node = glueContext.create_dynamic_frame.from_catalog( database="<db_name_in_catalog>", table_name="<table_name_in_catalog>", transformation_ctx="AmazonS3_node", )

dataFrame=AmazonS3_node.toDF()

additional_options={ "hoodie.table.name": "<your_table_name>", "hoodie.datasource.write.storage.type": "COPY_ON_WRITE", "hoodie.datasource.write.operation": "upsert", "hoodie.datasource.write.recordkey.field": "<your_recordkey_field>", "hoodie.datasource.write.precombine.field": "<your_precombine_field>", "hoodie.datasource.write.partitionpath.field": "<your_partitionkey_field>", "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.hive_sync.database": "<your_database_name>", "hoodie.datasource.hive_sync.table": "<your_table_name>", "hoodie.datasource.hive_sync.partition_fields": "<your_partitionkey_field>", "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.hive_sync.use_jdbc": "false", "hoodie.datasource.hive_sync.mode": "hms", "path": "s3://<s3Path/>" }

dataFrame.write.format("hudi")
.options(**additional_options)
.mode("overwrite")
.save()

job.commit()


-->Alternative Way

Additionally you can use Glue Visual studio to automatically generate a code for writing a Hudi table in s3

Thank You!

AWS
Sahil_S
beantwortet vor 5 Monaten
0

Hi, I am able to work with this solution. Thank you. Unfortunately the error is resolved , table structure is created but the records are not getting inserted in the hudi. I can not see any exception in the log too. Can you give me some hint?

Asmita
beantwortet vor 5 Monaten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen