AWS Glue, BQ export to S3

0

Hi, I'm trying to import data from a bigquery table into an S3 bucket in a CSV format. When I do so it gets split out into 15-25 files and I'm wondering if there is a way for me to output it in a single csv file as it's much more manageble. It's a very small file below 10 MB

asked 8 months ago126 views
1 Answer
0

Glue uses Apache Spark under the hood, which employs parallel processing to speed up data processing. When processing large datasets, Spark will automatically partition the data into multiple partitions and write these partitions out as separate output files to improve performance. So, Glue jobs can indeed create multiple output files, which is a common scenario when processing large datasets. You can use Spark's coalesce() or repartition() operations to reduce the number of partitions before writing the data, which in turn reduces the number of output files.

If you're using a Glue ETL job, you can use the coalesce() method to reduce the number of partitions to 1 before writing the output.

# Assume you have a dynamic_frame  after reading data from BigQuery
# Coalesce to a single partition
single_partition_dyf = dynamic_frame.coalesce(1) 

# Write to S3
glueContext.write_dynamic_frame.from_options(
    frame=single_partition_dyf,
    connection_type="s3",
    connection_options={"path": "s3://your-bucket/output-path/"},
    format="glueparquet",
    format_options={"compression": "snappy"}
)

Above approach will create a single file, but keep in mind that for very large datasets, this might not be efficient or even possible due to memory constraints.

answered 8 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions