Glue 4 Hudi support

0

I am trying to store a data stream from kafka using the hudi format. I am following this doc https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html and I even tried to create a visual job. When the job attempts to write a batch I am always getting the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o372.pyWriteDynamicFrame. : java.lang.ClassNotFoundException: Failed to load format with name hudi

I am using Glue 4 but I get the same error with Glue 3. I have double-checked that the job parameters are exactly like those in the doc above.

Any suggestion?

  • Check the classpath on SparkUI, sounds the Hudi libraries are not added, which should happen if you did add the argument --datalake-formats=hudi

profile picture
질문됨 9달 전270회 조회
3개 답변
0

Looks like the parameters are correct.

Enter image description here

profile picture
답변함 9달 전
  • That does look correct, you must have something that odd in your job (lots of people use Hudi like that). Do you have any connections? Maybe the issue is in the code call, have you tried using DataFrame?

0

Using a DataFrame works. If you can please report that the sample generated by the Visual tool that uses a DynamicFrame is buggy. Thanks

profile picture
답변함 9달 전
  • You are right, it doesn't handle the s3 sink correctly for streaming visual jobs, reported.

0

I checked also the classpath and it looks ok to me. Enter image description here

  • Yes the example uses a Kafka connection as data source
  • The job is generated by the Visual tool
  • I will try using a dataframe
profile picture
답변함 9달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인