Glue 4 Hudi support

0

I am trying to store a data stream from kafka using the hudi format. I am following this doc https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html and I even tried to create a visual job. When the job attempts to write a batch I am always getting the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o372.pyWriteDynamicFrame. : java.lang.ClassNotFoundException: Failed to load format with name hudi

I am using Glue 4 but I get the same error with Glue 3. I have double-checked that the job parameters are exactly like those in the doc above.

Any suggestion?

  • Check the classpath on SparkUI, sounds the Hudi libraries are not added, which should happen if you did add the argument --datalake-formats=hudi

profile picture
已提問 9 個月前檢視次數 270 次
3 個答案
0

Looks like the parameters are correct.

Enter image description here

profile picture
已回答 9 個月前
  • That does look correct, you must have something that odd in your job (lots of people use Hudi like that). Do you have any connections? Maybe the issue is in the code call, have you tried using DataFrame?

0

Using a DataFrame works. If you can please report that the sample generated by the Visual tool that uses a DynamicFrame is buggy. Thanks

profile picture
已回答 9 個月前
  • You are right, it doesn't handle the s3 sink correctly for streaming visual jobs, reported.

0

I checked also the classpath and it looks ok to me. Enter image description here

  • Yes the example uses a Kafka connection as data source
  • The job is generated by the Visual tool
  • I will try using a dataframe
profile picture
已回答 9 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南