AWS Glue Input file name is returning empty string when use data catalog

0

we are using crawler and custom classifier to parse fixed length file. As part of our requirement, need to extract input file name. Input files stores into S3 Folder

S3 Folder ----> Crawler (custom classifier) ----> data catalog<-------AWS Glue job (ETL) ---> Store into S3

import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from pyspark.sql.functions import input_file_name from awsglue.dynamicframe import DynamicFrame

sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog( database="test-poc", table_name="test-raw", transformation_ctx="datasource0", groupFiles='none', )

Create a DataFrame and add a new column in the containing the file name of every DataRecord

dataframe1 = datasource0.toDF().withColumn("filename", input_file_name()) dataframe1.show()

input_file_name is returning empty string

asked 4 months ago190 views
1 Answer
0

That function is a DataFrame feature, you are creating a DynamicFrame and then converting, I don't think it can track the source files if you do that.
Why don't you just read a DataFrame, using spark.table, spark.sql() or the GlueContext method to create DataFrames.

profile pictureAWS
EXPERT
answered 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions