2 個答案
- 最新
- 最多得票
- 最多評論
0
You could create folders within your data path and the Glue crawler will create partitions in the schema for every folder. If you name the folders like key=value the column name for the partition will be key.
For instance if your filenames are named source_a..., source_b...
you should create the following
s3/your-bucket/data/source=A/source_a_...csv
s3/your-bucket/data/source=A/source_a_...csv
s3/your-bucket/data/source=B/source_b_...csv
s3/your-bucket/data/source=B/source_b_...csv
so the schema will have a source column for the source partition.
已回答 4 年前
0
Apache Spark has input_file_name() method
You can add a new column based on this. See the below sample.
df = spark.read.parquet('path_to_file')\
.withColumn('filepath', input_file_name())
See details here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.input_file_name
已回答 4 年前
相關內容
- AWS 官方已更新 2 年前