Glue ETL Not Update Data Catalog

0

Hi, I am creating a Glue Job that is transforming CSV files to partitioned Parquet files and I want to update the Data Catalog from the ETL. I am using this snipe of code to do that:

    dynamic_frame: DynamicFrame = DynamicFrame.fromDF(final_data, glue_context, f"{file_type}_dataset")
    sink = glue_context.getSink(
        connection_type="s3",
        enableUpdateCatalog=True,
        updateBehavior="UPDATE_IN_DATABASE",
        path=f"{target}",
        partitionKeys=partition_cols)

    sink.setFormat("glueparquet")
    sink.setCatalogInfo(catalogDatabase=conf.get_db_name(),
                        catalogTableName=conf.get_table_name_by_source(file_type))
    sink.writeFrame(dynamic_frame)

As you can see I am transforming a Spark DF to a glue DynamicFrame and later writing it to parquet.

The output parquet files are written but I am getting this error and the tables are not created in the Data Catalog:

Exception: Problem processing file type cdr_cs because An error occurred while calling o477.pyWriteDynamicFrame.
: scala.MatchError: (null,false) (of class scala.Tuple2)
	at com.amazonaws.services.glue.DataSink.forwardPotentialDynamicFrameToCatalog(DataSink.scala:177)
	at com.amazonaws.services.glue.DataSink.forwardPotentialDynamicFrameToCatalog(DataSink.scala:135)
	at com.amazonaws.services.glue.sinks.HadoopDataSink.$anonfun$writeDynamicFrame$2(HadoopDataSink.scala:302)
	at com.amazonaws.services.glue.util.FileSchemeWrapper.$anonfun$executeWithQualifiedScheme$1(FileSchemeWrapper.scala:77)
	at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWith(FileSchemeWrapper.scala:70)
	at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWithQualifiedScheme(FileSchemeWrapper.scala:77)
	at com.amazonaws.services.glue.sinks.HadoopDataSink.$anonfun$writeDynamicFrame$1(HadoopDataSink.scala:157)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
	at com.amazonaws.services.glue.sinks.HadoopDataSink.writeDynamicFrame(HadoopDataSink.scala:151)
	at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:64)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)
Rafael
질문됨 2년 전1644회 조회
1개 답변
2
수락된 답변

This does not give all the information. Given the error happens while writing to the DynamicFrame, there might be a difference in the file layout and the table definition in the Glue Data Catalog.

In some situations this could be access issues as well. Check the IAM access and if Lake formation if turned on, check on Lake formation as well.

profile pictureAWS
답변함 2년 전
  • Thanks for your answer. There are no tables in the database in the Data Catalog so maybe the problem has to do with a connectivity or conf problem. The thing is that the error above is quite difficult to troubleshoot because has no meaning at all.

  • Finally the problem was that Lake formation was not activated. Thanks @ananthtm for your help.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인