Adding the filename as a column in the output

Question

Does anyone know of a way to add the source filename as a column in a Glue project? We just started using it, and have it crawling for files in S3, applying some simple transformations, and then writing to a clean file in S3. In our data set we would like to know the source file that each row of data originated from, so it would be ideal if we could somehow add in the filename as a column or something like that. I looked through the documentation and the aws-glue-libs source, but didn't see anything.   
  
I'm still learning Glue, so apologies if I'm using the wrong terminology.

Answer

Apache Spark has **input_file_name()** method  
You can add a new column based on this. See the below sample.

```
df = spark.read.parquet('path_to_file')\
        .withColumn('filepath', input_file_name())
```

See details here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.input_file_name

Answer

You could create folders within your data path and the Glue crawler will create partitions in the schema for every folder. If you name the folders like _key=value_ the column name for the partition will be _key_.   
  
For instance if your filenames are named **source_a...**, **source_b...**   
you should create the following

```
s3/your-bucket/data/source=A/source_a_...csv
s3/your-bucket/data/source=A/source_a_...csv
s3/your-bucket/data/source=B/source_b_...csv
s3/your-bucket/data/source=B/source_b_...csv
```

so the schema will have a **source** column for the source partition.

Adding the filename as a column in the output

相關內容