2 Answers
- Newest
- Most votes
- Most comments
0
You could create folders within your data path and the Glue crawler will create partitions in the schema for every folder. If you name the folders like key=value the column name for the partition will be key.
For instance if your filenames are named source_a..., source_b...
you should create the following
s3/your-bucket/data/source=A/source_a_...csv
s3/your-bucket/data/source=A/source_a_...csv
s3/your-bucket/data/source=B/source_b_...csv
s3/your-bucket/data/source=B/source_b_...csv
so the schema will have a source column for the source partition.
answered 4 years ago
0
Apache Spark has input_file_name() method
You can add a new column based on this. See the below sample.
df = spark.read.parquet('path_to_file')\
.withColumn('filepath', input_file_name())
See details here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.input_file_name
answered 4 years ago
Relevant content
- Accepted Answerasked 3 years ago
- Accepted Answerasked 3 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 10 months ago