- Newest
- Most votes
- Most comments
Glue uses Apache Spark under the hood, which employs parallel processing to speed up data processing. When processing large datasets, Spark will automatically partition the data into multiple partitions and write these partitions out as separate output files to improve performance. So, Glue jobs can indeed create multiple output files, which is a common scenario when processing large datasets. You can use Spark's coalesce() or repartition() operations to reduce the number of partitions before writing the data, which in turn reduces the number of output files.
If you're using a Glue ETL job, you can use the coalesce() method to reduce the number of partitions to 1 before writing the output.
# Assume you have a dynamic_frame after reading data from BigQuery # Coalesce to a single partition single_partition_dyf = dynamic_frame.coalesce(1) # Write to S3 glueContext.write_dynamic_frame.from_options( frame=single_partition_dyf, connection_type="s3", connection_options={"path": "s3://your-bucket/output-path/"}, format="glueparquet", format_options={"compression": "snappy"} )
Above approach will create a single file, but keep in mind that for very large datasets, this might not be efficient or even possible due to memory constraints.
Relevant content
- asked 2 years ago
- asked 2 years ago