s3-dist-cp expected performance

Question

I am trying to evaluate whether s3-dist-cp tool would be helpful for the following use case: I have several millions of small text files (with size of tens of KB each) in an S3 bucket that I need to concatenate to bigger files before processing them further with Spark.
To test s3-dist-cp, I tried it first on a smaller bucket with ~550000 files (~6.8GB total size). I launched an EMR cluster with 15 core nodes (m6g.xlarge instance type, 4 VCPUs / 16GB RAM) and ran the tool with a command like the following:

`s3-dist-cp --src s3://mybucket/data/ --dest hdfs:///my_data --outputCodec=gz --targetSize=128 --groupBy='.*(celery-task-meta).*'`

This took 1hr and 38m to complete.. is this kind of duration expected / normal? Is there anything I could do to speed it up?

Thanks in advance!

Accepted Answer

`s3-dist-cp` uses Hadoop Map Reduce to do the copy job. when reading or writing to s3, it would use [EMRFS](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-fs.html) to make GET / PUT / LIST calls to S3. 
So, to tune performance  of your job runtime, 
- you have to be aware of how Hadoop works and how it integrates with YARN. 
- Also tuning can be performed  on the File System(S3 and HDFS) to improve Read / Write / Listing performance.

benchmarking need to be done to really understand if  `1hr and 38m` is normal time for this cluster size.

You can monitor the Mapreduce job using [YARN  Resource Manager UI](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html) &  MR Job history server logs to identify where bulk of the time is being spent.

- Are any containers(Mappers or Reducers) in PENDING state waiting for resources to be assigned by YARN  ? Is Mapper / reducers running into memory Issues ? 
in which case you need bigger cluster or tune Map Reduce memory [settings](https://stackoverflow.com/questions/41454796/aws-emr-parallel-mappers) )

-  is s3-dist-cp spending too much time Listing S3 objects before even running Mappers / Reducers ? 
Increase s3-dist-cp client heap space so that it can handle listing many s3 objects on your source s3 bucket
`export HADOOP_OPTS="-Xmx5000m -verbose:gc -XX:+UseMembar -XX:+PrintGCDetails -Xloggc:/tmp/gc.log" ; s3-dist-cp --src s3://bucket/object/ --dest s3://dest-bucket/object/`

To improve performance to S3, you can use "fs.s3.*" parameters which will alter EMRFS behavior. 
some parameters you can consider tuning : 
fs.s3.maxConnections
fs.s3.maxRetries (to deal with Throttling from S3)
Please note that some of EMRFS parameters may not be existent of not publicly documented. For example how to modify Listing behavior of EMRFS  etc.

So, 
You might try using `s3a://` prefix in your s3-dist-cp command which will invoke S3A File System(which is part of open source Hadoop) instead of EMRFS. 
`s3-dist-cp --src s3a://mybucket/data / --dest hdfs:///my_data --outputCodec=gz --targetSize=128 --groupBy='.*(celery-task-meta).*'`

S3A File System parameters are well documented and explained in this article :  
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/performance.html
This allows to additionally tune File System related parameters to speed up your Job.

Additionally, HDFS Write performance tuning can be considered if needed. But we rarely see performance issues with HDFS.

s3-dist-cp expected performance

Contenuto pertinente