Adding the filename as a column in the output

0

Does anyone know of a way to add the source filename as a column in a Glue project? We just started using it, and have it crawling for files in S3, applying some simple transformations, and then writing to a clean file in S3. In our data set we would like to know the source file that each row of data originated from, so it would be ideal if we could somehow add in the filename as a column or something like that. I looked through the documentation and the aws-glue-libs source, but didn't see anything.

I'm still learning Glue, so apologies if I'm using the wrong terminology.

mwatson
질문됨 6년 전2818회 조회
2개 답변
0

You could create folders within your data path and the Glue crawler will create partitions in the schema for every folder. If you name the folders like key=value the column name for the partition will be key.

For instance if your filenames are named source_a..., source_b...
you should create the following

s3/your-bucket/data/source=A/source_a_...csv
s3/your-bucket/data/source=A/source_a_...csv
s3/your-bucket/data/source=B/source_b_...csv
s3/your-bucket/data/source=B/source_b_...csv

so the schema will have a source column for the source partition.

답변함 4년 전
0

Apache Spark has input_file_name() method
You can add a new column based on this. See the below sample.

df = spark.read.parquet('path_to_file')\
        .withColumn('filepath', input_file_name())

See details here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.input_file_name

AWS
답변함 4년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠