Redshift UNLOAD parquet file size

0

My customer has a 2 - 4 nodes of dc2.8 xlarge Redshift cluster and they want to export data to parquet in the optimal size (~1GB) per file with option (MAXFILESIZE AS 1GB). But the engine somehow exported the total of 500MB files into 64 files (average size from 5mb -25mb).

My question:

  1. How can we control the size per parquet file?
  2. How does Redshift determine the optimal file size?
AWS
已提問 4 年前檢視次數 1646 次
1 個回答
0
已接受的答案

The UNLOAD command in its default configuration unloads a number of files equal to number of slices. For a DC2.8xlarge 4 node cluster the number of slices are 64 (4 node * 16 slices per node). This is the default behavior and makes all the slices at work in parallel. Redshift tries to make the files sizes in chunk of 32 MB row group when unloaded in Parquet format. For smaller data volume where 32 MB chunk are big enough it will generate smaller files. The multiple files are effective than a single file as the later case Redshift combines the data from table and then generate a single file- less effective for parallel compute nodes.

One solution to generate fixed size file is to use the UNLOAD option PARALLEL OFF and MAXFILESIZE 1GB.

AWS
已回答 4 年前
profile picture
專家
已審閱 2 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南