RDS Export to S3 produces too many parquet files.

0

I am using the "RDS export snapshot to S3" feature to extract some tables from our Postgres RDS cluster.

The largest table in the export has

  • 12 billion rows
  • 35 columns
  • 3TB in Postgres (not including indexes)

When exported this table shows up in S3 broken up into about 360,000 parquet files, and just under 1TB in storage size.

The fact that the export generates so many files is a huge problem. Beyond the obvious inefficiency and massive overhead of processing so many objects (https://docs.aws.amazon.com/athena/latest/ug/performance-tuning.html#performance-tuning-avoid-having-too-many-files), writing so many new files also throws off alarms regarding our S3 activity.

To me this is a flaw in the S3 Export implementation, but until the root cause is addressed, is there anything I can do in how I configure the export, or even the Postgres table itself to avoid creating so many tiny files?

Many thanks

1 Answer
1
Accepted Answer

Hello,

I have checked internally and Unfortunately, there is no way of controlling the file Size. Currently there is no parameter/setting that can be changed in order to create larger file sizes to avoid creating too many small parquet files.

The reason for this is that the file size is automatically allocated by the automation in order to speed up the process and there is currently no way to customize this, however please be aware that this behavior is currently under review by our Internal teams. I would Kindly advise that you check our What's New page[1] and Blog[2] for latest updates on AWS where we announce all new updates/features when we release them.

In order to reduce the number of files being exported you may consider using the "Partial" [3] option when exporting in order to export only the required databases, tables, etc. By selecting 'Partial' when exporting, you're only moving data which is necessary for your analysis purpose and not entire database.

On behalf of AWS, I would like to apologize for any inconvenience caused by this.

References:

[1] https://aws.amazon.com/new/

[2] https://aws.amazon.com/blogs/aws/

[3] https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/USER_ExportSnapshot.html#USER_ExportSnapshot.Exporting

AWS
answered 7 months ago
profile picture
EXPERT
reviewed a month ago
  • Thank you for the answer. I am in fact using the partial export filter. Exporting the entire database ran into even bigger problems. We have some partitioned tables in our DB and the S3 Export gets very confused by these and does a full export at every level of the hierarchy. It was creating literally millions of S3 objects.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions