2 réponses
- Le plus récent
- Le plus de votes
- La plupart des commentaires
0
You could create folders within your data path and the Glue crawler will create partitions in the schema for every folder. If you name the folders like key=value the column name for the partition will be key.
For instance if your filenames are named source_a..., source_b...
you should create the following
s3/your-bucket/data/source=A/source_a_...csv
s3/your-bucket/data/source=A/source_a_...csv
s3/your-bucket/data/source=B/source_b_...csv
s3/your-bucket/data/source=B/source_b_...csv
so the schema will have a source column for the source partition.
répondu il y a 4 ans
0
Apache Spark has input_file_name() method
You can add a new column based on this. See the below sample.
df = spark.read.parquet('path_to_file')\
.withColumn('filepath', input_file_name())
See details here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.input_file_name
répondu il y a 4 ans
Contenus pertinents
- demandé il y a un an
- demandé il y a 2 mois
- demandé il y a 6 mois
- demandé il y a un an