Hi, I am creating a Glue Job that is transforming CSV files to partitioned Parquet files and I want to update the Data Catalog from the ETL. I am using this snipe of code to do that:
dynamic_frame: DynamicFrame = DynamicFrame.fromDF(final_data, glue_context, f"{file_type}_dataset")
sink = glue_context.getSink(
connection_type="s3",
enableUpdateCatalog=True,
updateBehavior="UPDATE_IN_DATABASE",
path=f"{target}",
partitionKeys=partition_cols)
sink.setFormat("glueparquet")
sink.setCatalogInfo(catalogDatabase=conf.get_db_name(),
catalogTableName=conf.get_table_name_by_source(file_type))
sink.writeFrame(dynamic_frame)
As you can see I am transforming a Spark DF to a glue DynamicFrame and later writing it to parquet.
The output parquet files are written but I am getting this error and the tables are not created in the Data Catalog:
Exception: Problem processing file type cdr_cs because An error occurred while calling o477.pyWriteDynamicFrame.
: scala.MatchError: (null,false) (of class scala.Tuple2)
at com.amazonaws.services.glue.DataSink.forwardPotentialDynamicFrameToCatalog(DataSink.scala:177)
at com.amazonaws.services.glue.DataSink.forwardPotentialDynamicFrameToCatalog(DataSink.scala:135)
at com.amazonaws.services.glue.sinks.HadoopDataSink.$anonfun$writeDynamicFrame$2(HadoopDataSink.scala:302)
at com.amazonaws.services.glue.util.FileSchemeWrapper.$anonfun$executeWithQualifiedScheme$1(FileSchemeWrapper.scala:77)
at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWith(FileSchemeWrapper.scala:70)
at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWithQualifiedScheme(FileSchemeWrapper.scala:77)
at com.amazonaws.services.glue.sinks.HadoopDataSink.$anonfun$writeDynamicFrame$1(HadoopDataSink.scala:157)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
at com.amazonaws.services.glue.sinks.HadoopDataSink.writeDynamicFrame(HadoopDataSink.scala:151)
at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Thanks for your answer. There are no tables in the database in the Data Catalog so maybe the problem has to do with a connectivity or conf problem. The thing is that the error above is quite difficult to troubleshoot because has no meaning at all.
Finally the problem was that Lake formation was not activated. Thanks @ananthtm for your help.