I'm using S3DistCp (s3-dist-cp) to concatenate files in Apache Parquet format with the --groupBy and --targetSize options. The s3-dist-cp job completes without errors, but the generated Parquet files are broken. When I try to read the Parquet files in applications, I get an error message similar to the following:
"Expected n values in column chunk at /path/to/concatenated/parquet/file offset m but got x values instead over y pages ending at file offset z"
S3DistCp doesn't support concatenation for Parquet files. Use PySpark instead.
You can't specify the target file size in PySpark, but you can specify the number of partitions. Spark saves each partition to a separate output file. To estimate the number of partitions that you need, divide the size of the dataset by the target individual file size.
1. Create an Amazon EMR cluster with Apache Spark installed.
2. Specify how many executors you need. This depends on cluster capacity and dataset size. For more information, see Best practices for successfully managing memory for Apache Spark applications on Amazon EMR.
$ pyspark --num-executors number_of_executors
3. Load the source Parquet files into a Spark DataFrame. This can be an Amazon Simple Storage Service (Amazon S3) path or an HDFS path. For example:
4. Repartition the DataFrame. In the following example, n is the number of partitions.
5. Save the DataFrame to the destination. This can be an Amazon S3 path or an HDFS path. For example:
6. Verify how many files are now in the destination directory:
hadoop fs -ls "URI:s3://awsdoc-example-bucket1/destination/ | wc -l"
The total number of files should be the value of n from step 4, plus one. The Parquet output committer writes the extra file, called _SUCCESS.