RDS Export to S3 produces too many parquet files.

0

I am using the "RDS export snapshot to S3" feature to extract some tables from our Postgres RDS cluster.

The largest table in the export has

  • 12 billion rows
  • 35 columns
  • 3TB in Postgres (not including indexes)

When exported this table shows up in S3 broken up into about 360,000 parquet files, and just under 1TB in storage size.

The fact that the export generates so many files is a huge problem. Beyond the obvious inefficiency and massive overhead of processing so many objects (https://docs.aws.amazon.com/athena/latest/ug/performance-tuning.html#performance-tuning-avoid-having-too-many-files), writing so many new files also throws off alarms regarding our S3 activity.

To me this is a flaw in the S3 Export implementation, but until the root cause is addressed, is there anything I can do in how I configure the export, or even the Postgres table itself to avoid creating so many tiny files?

Many thanks

bakur
已提問 7 個月前檢視次數 611 次
1 個回答
1
已接受的答案

Hello,

I have checked internally and Unfortunately, there is no way of controlling the file Size. Currently there is no parameter/setting that can be changed in order to create larger file sizes to avoid creating too many small parquet files.

The reason for this is that the file size is automatically allocated by the automation in order to speed up the process and there is currently no way to customize this, however please be aware that this behavior is currently under review by our Internal teams. I would Kindly advise that you check our What's New page[1] and Blog[2] for latest updates on AWS where we announce all new updates/features when we release them.

In order to reduce the number of files being exported you may consider using the "Partial" [3] option when exporting in order to export only the required databases, tables, etc. By selecting 'Partial' when exporting, you're only moving data which is necessary for your analysis purpose and not entire database.

On behalf of AWS, I would like to apologize for any inconvenience caused by this.

References:

[1] https://aws.amazon.com/new/

[2] https://aws.amazon.com/blogs/aws/

[3] https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/USER_ExportSnapshot.html#USER_ExportSnapshot.Exporting

AWS
已回答 7 個月前
profile picture
專家
已審閱 2 個月前
  • Thank you for the answer. I am in fact using the partial export filter. Exporting the entire database ran into even bigger problems. We have some partitioned tables in our DB and the S3 Export gets very confused by these and does a full export at every level of the hierarchy. It was creating literally millions of S3 objects.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南