RDS Export to S3 produces too many parquet files.

0

I am using the "RDS export snapshot to S3" feature to extract some tables from our Postgres RDS cluster.

The largest table in the export has

  • 12 billion rows
  • 35 columns
  • 3TB in Postgres (not including indexes)

When exported this table shows up in S3 broken up into about 360,000 parquet files, and just under 1TB in storage size.

The fact that the export generates so many files is a huge problem. Beyond the obvious inefficiency and massive overhead of processing so many objects (https://docs.aws.amazon.com/athena/latest/ug/performance-tuning.html#performance-tuning-avoid-having-too-many-files), writing so many new files also throws off alarms regarding our S3 activity.

To me this is a flaw in the S3 Export implementation, but until the root cause is addressed, is there anything I can do in how I configure the export, or even the Postgres table itself to avoid creating so many tiny files?

Many thanks

1回答
1
承認された回答

Hello,

I have checked internally and Unfortunately, there is no way of controlling the file Size. Currently there is no parameter/setting that can be changed in order to create larger file sizes to avoid creating too many small parquet files.

The reason for this is that the file size is automatically allocated by the automation in order to speed up the process and there is currently no way to customize this, however please be aware that this behavior is currently under review by our Internal teams. I would Kindly advise that you check our What's New page[1] and Blog[2] for latest updates on AWS where we announce all new updates/features when we release them.

In order to reduce the number of files being exported you may consider using the "Partial" [3] option when exporting in order to export only the required databases, tables, etc. By selecting 'Partial' when exporting, you're only moving data which is necessary for your analysis purpose and not entire database.

On behalf of AWS, I would like to apologize for any inconvenience caused by this.

References:

[1] https://aws.amazon.com/new/

[2] https://aws.amazon.com/blogs/aws/

[3] https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/USER_ExportSnapshot.html#USER_ExportSnapshot.Exporting

AWS
回答済み 7ヶ月前
profile picture
エキスパート
レビュー済み 1ヶ月前
  • Thank you for the answer. I am in fact using the partial export filter. Exporting the entire database ran into even bigger problems. We have some partitioned tables in our DB and the S3 Export gets very confused by these and does a full export at every level of the hierarchy. It was creating literally millions of S3 objects.

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ