AWS Glue Input file name is returning empty string when use data catalog

0

we are using crawler and custom classifier to parse fixed length file. As part of our requirement, need to extract input file name. Input files stores into S3 Folder

S3 Folder ----> Crawler (custom classifier) ----> data catalog<-------AWS Glue job (ETL) ---> Store into S3

import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from pyspark.sql.functions import input_file_name from awsglue.dynamicframe import DynamicFrame

sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog( database="test-poc", table_name="test-raw", transformation_ctx="datasource0", groupFiles='none', )

Create a DataFrame and add a new column in the containing the file name of every DataRecord

dataframe1 = datasource0.toDF().withColumn("filename", input_file_name()) dataframe1.show()

input_file_name is returning empty string

質問済み 5ヶ月前228ビュー
1回答
0

That function is a DataFrame feature, you are creating a DynamicFrame and then converting, I don't think it can track the source files if you do that.
Why don't you just read a DataFrame, using spark.table, spark.sql() or the GlueContext method to create DataFrames.

profile pictureAWS
エキスパート
回答済み 5ヶ月前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ