- Newest
- Most votes
- Most comments
It is not that s3-dist-cp is better than Spark at this particular problem. The problem by it's very definition is a bottleneck - i.e. multiple input files can be read by multiple readers but has to be merged into one output file by a single writer. s3-dist-cp will use multiple mappers to read the input files and a single reducer to write the output file. Spark will do the same, multiple executors can be used to read the input files but the eventual coalesce will consolidate to a single executor/partition to write the final file. Spark should actually be faster as the intermediate data will be in memory than s3-dist-cp which will save intermediate files to disk.
s3-dist-cp does perform better when the output is being saved in S3, and a lot of files are being written out to S3. This is because Spark does a more complicated path of writing files to a temp directory in S3 and then finally renaming all files to the main directory, and those are expensive steps in S3.
Here Spark and coalesce(1) are the best options, and the output file should be saved to HDFS. Once the final file is in HDFS, use the AWS S3 cli to copy the file to S3, if that is where it needs to be. Spark can recognize headers in CSV files and can insert headers in output CSV files as well, so there would be no additional effort to strip or add headers.
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated a year ago