EMR Spark output file consolidation in S3

1

Can you suggest a good way to consolidate partitioned Spark data into a single CSV file. From what is seen online, the simple Spark approach to combining data (repartition or coalesce) performs poorly and it is recommended to output as partitioned data files and then combine them into one.

s3-dist-cp seems to be the right utility to use here, but one thing that is not clear is how to keep one set of headers at the top of the combined csv file.

Is there a simple way to consolidate output files while stripping all but one file headers?

AWS
Tom_L
asked 6 years ago801 views
1 Answer
1
Accepted Answer

It is not that s3-dist-cp is better than Spark at this particular problem. The problem by it's very definition is a bottleneck - i.e. multiple input files can be read by multiple readers but has to be merged into one output file by a single writer. s3-dist-cp will use multiple mappers to read the input files and a single reducer to write the output file. Spark will do the same, multiple executors can be used to read the input files but the eventual coalesce will consolidate to a single executor/partition to write the final file. Spark should actually be faster as the intermediate data will be in memory than s3-dist-cp which will save intermediate files to disk.

s3-dist-cp does perform better when the output is being saved in S3, and a lot of files are being written out to S3. This is because Spark does a more complicated path of writing files to a temp directory in S3 and then finally renaming all files to the main directory, and those are expensive steps in S3.

Here Spark and coalesce(1) are the best options, and the output file should be saved to HDFS. Once the final file is in HDFS, use the AWS S3 cli to copy the file to S3, if that is where it needs to be. Spark can recognize headers in CSV files and can insert headers in output CSV files as well, so there would be no additional effort to strip or add headers.

AWS
answered 6 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions