【以下的问题经过翻译处理】 你好,我正在创建一个Glue作业,将CSV文件转换为分区Parquet文件,并希望从ETL更新数据目录。 使用以下代码来完成此操作:
dynamic_frame: DynamicFrame = DynamicFrame.fromDF(final_data, glue_context, f"{file_type}_dataset")
sink = glue_context.getSink(
connection_type="s3",
enableUpdateCatalog=True,
updateBehavior="UPDATE_IN_DATABASE",
path=f"{target}",
partitionKeys=partition_cols)
sink.setFormat("glueparquet")
sink.setCatalogInfo(catalogDatabase=conf.get_db_name(),
catalogTableName=conf.get_table_name_by_source(file_type))
sink.writeFrame(dynamic_frame)
正如您所看到的,将Spark DF转换为Glue DynamicFrame并将其写入到parquet中。
输出的parquet文件已经写入,但是我遇到了此错误,数据目录中没有表:
Exception: Problem processing file type cdr_cs because An error occurred while calling o477.pyWriteDynamicFrame.
: scala.MatchError: (null,false) (of class scala.Tuple2)
at com.amazonaws.services.glue.DataSink.forwardPotentialDynamicFrameToCatalog(DataSink.scala:177)
at com.amazonaws.services.glue.DataSink.forwardPotentialDynamicFrameToCatalog(DataSink.scala:135)
at com.amazonaws.services.glue.sinks.HadoopDataSink.$anonfun$writeDynamicFrame$2(HadoopDataSink.scala:302)
at com.amazonaws.services.glue.util.FileSchemeWrapper.$anonfun$executeWithQualifiedScheme$1(FileSchemeWrapper.scala:77)
at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWith(FileSchemeWrapper.scala:70)
at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWithQualifiedScheme(FileSchemeWrapper.scala:77)
at com.amazonaws.services.glue.sinks.HadoopDataSink.$anonfun$writeDynamicFrame$1(HadoopDataSink.scala:157)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
at com.amazonaws.services.glue.sinks.HadoopDataSink.writeDynamicFrame(HadoopDataSink.scala:151)
at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl