You could create folders within your data path and the Glue crawler will create partitions in the schema for every folder. If you name the folders like key=value the column name for the partition will be key.
For instance if your filenames are named source_a..., source_b...
you should create the following
s3/your-bucket/data/source=A/source_a_...csv s3/your-bucket/data/source=A/source_a_...csv s3/your-bucket/data/source=B/source_b_...csv s3/your-bucket/data/source=B/source_b_...csv
so the schema will have a source column for the source partition.
Apache Spark has input_file_name() method
You can add a new column based on this. See the below sample.
df = spark.read.parquet('path_to_file')\ .withColumn('filepath', input_file_name())
Using S3 bucket as a file server for the publicasked 4 months ago
data loading to s3 in csv format is adding line breaker in rows, randomly to one columnasked 5 months ago
Why is Redshift Spectrum showing the filename in the first column?asked 10 months ago
How to escape a comma in a csv file in AWS Glue?Accepted AnswerMODERATORasked 3 years ago
How to avoid transforming data for a dropped column without updating glue catalogasked 6 months ago
Case sensitivity problem in view of JSON in a SUPER columnasked a year ago
Glue Studio Job - How to add the file name or path as a Source Key output or partitionAccepted Answerasked 6 months ago
Is there a way to add a step between other steps in a recipe with the UI?asked 2 years ago
Adding the filename as a column in the outputasked 4 years ago
How to keep the source file name in the target output file with a AWS Glue jobAccepted Answerasked 2 years ago