RDS Export to S3 produces too many parquet files.

0

I am using the "RDS export snapshot to S3" feature to extract some tables from our Postgres RDS cluster.

The largest table in the export has

  • 12 billion rows
  • 35 columns
  • 3TB in Postgres (not including indexes)

When exported this table shows up in S3 broken up into about 360,000 parquet files, and just under 1TB in storage size.

The fact that the export generates so many files is a huge problem. Beyond the obvious inefficiency and massive overhead of processing so many objects (https://docs.aws.amazon.com/athena/latest/ug/performance-tuning.html#performance-tuning-avoid-having-too-many-files), writing so many new files also throws off alarms regarding our S3 activity.

To me this is a flaw in the S3 Export implementation, but until the root cause is addressed, is there anything I can do in how I configure the export, or even the Postgres table itself to avoid creating so many tiny files?

Many thanks

1개 답변
1
수락된 답변

Hello,

I have checked internally and Unfortunately, there is no way of controlling the file Size. Currently there is no parameter/setting that can be changed in order to create larger file sizes to avoid creating too many small parquet files.

The reason for this is that the file size is automatically allocated by the automation in order to speed up the process and there is currently no way to customize this, however please be aware that this behavior is currently under review by our Internal teams. I would Kindly advise that you check our What's New page[1] and Blog[2] for latest updates on AWS where we announce all new updates/features when we release them.

In order to reduce the number of files being exported you may consider using the "Partial" [3] option when exporting in order to export only the required databases, tables, etc. By selecting 'Partial' when exporting, you're only moving data which is necessary for your analysis purpose and not entire database.

On behalf of AWS, I would like to apologize for any inconvenience caused by this.

References:

[1] https://aws.amazon.com/new/

[2] https://aws.amazon.com/blogs/aws/

[3] https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/USER_ExportSnapshot.html#USER_ExportSnapshot.Exporting

AWS
답변함 7달 전
profile picture
전문가
검토됨 2달 전
  • Thank you for the answer. I am in fact using the partial export filter. Exporting the entire database ran into even bigger problems. We have some partitioned tables in our DB and the S3 Export gets very confused by these and does a full export at every level of the hierarchy. It was creating literally millions of S3 objects.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠