How to write a _SUCCESS files per partition instead of top level directory in AWS Glue?
Hello,
I am having a pyspark
application adding partitions (dynamic overwriting) to a AWS Glue table using insertInto
method. Upon completion of the task, a global .SUCCESS
file in the top level directory in S3
is being updated with the timestamp.
My desired behaviour would be to have .SUCCESS
files with timestamp inside the updated partition instead of the top level directory.
Is this possible?
Best,
N
Generally the _SUCCESS
marker is per full job.
There are 2 options I could think of -
- Write a custom committer that records the partitions that are being written to, update an accumulator and then have the driver create those files. This could be complex and error-prone.
- Writing out files directly to partition directory
path/to/table/partition_key1=foo/partition_key2=bar
but not tell the output that it's partitioned.
A generally-better option is to use a persistent metadata store (like Glue's Catalog) where you update the partition metadata after the write is confirmed complete.
Once the partition metadata is updated, you can use the Predicate pushdowns for partition columns. This predicate can be any SQL expression or user-defined function as long as it uses only the partition columns for filtering. Remember that you are applying this to the metadata stored in the catalog, so you don’t have access to other fields in the schema.
Relevant questions
Creating Partitions in Glue Tables
asked 5 months agoAWS Glue API get-partitions can't seem to cope with the partition column name "key" in the `expression` filter
asked 4 months agoCan't get Partitions to work with my Glue Data Catalog
Accepted Answerasked 2 months agoHow to control how result of Glue job is split into files?
Accepted Answerasked 4 years agoHow to write a _SUCCESS files per partition instead of top level directory in AWS Glue?
asked 14 days agoUsing Pandas in Glue ETL Job ( How to convert Dynamic DataFrame or PySpark Dataframe to Pandas Dataframe)
Accepted Answerasked 2 months agoCan Glue crawler be configured to include only the most recent partition in a table?
asked a month agoHow can I process flat files with a footer record in AWS Glue?
Accepted Answerasked 2 years agoAWS glue combining multiple input into a single output csv
asked 5 months agoPartition schema mismatch in Glue Table
asked a month ago