s3-dist-cp expected performance

0

I am trying to evaluate whether s3-dist-cp tool would be helpful for the following use case: I have several millions of small text files (with size of tens of KB each) in an S3 bucket that I need to concatenate to bigger files before processing them further with Spark. To test s3-dist-cp, I tried it first on a smaller bucket with ~550000 files (~6.8GB total size). I launched an EMR cluster with 15 core nodes (m6g.xlarge instance type, 4 VCPUs / 16GB RAM) and ran the tool with a command like the following:

s3-dist-cp --src s3://mybucket/data/ --dest hdfs:///my_data --outputCodec=gz --targetSize=128 --groupBy='.*(celery-task-meta).*'

This took 1hr and 38m to complete.. is this kind of duration expected / normal? Is there anything I could do to speed it up?

Thanks in advance!

nikos64
asked 2 years ago240 views
1 Answer
0
Accepted Answer

s3-dist-cp uses Hadoop Map Reduce to do the copy job. when reading or writing to s3, it would use EMRFS to make GET / PUT / LIST calls to S3. So, to tune performance of your job runtime,

  • you have to be aware of how Hadoop works and how it integrates with YARN.
  • Also tuning can be performed on the File System(S3 and HDFS) to improve Read / Write / Listing performance.

benchmarking need to be done to really understand if 1hr and 38m is normal time for this cluster size.

You can monitor the Mapreduce job using YARN Resource Manager UI & MR Job history server logs to identify where bulk of the time is being spent.

  • Are any containers(Mappers or Reducers) in PENDING state waiting for resources to be assigned by YARN ? Is Mapper / reducers running into memory Issues ? in which case you need bigger cluster or tune Map Reduce memory settings )

  • is s3-dist-cp spending too much time Listing S3 objects before even running Mappers / Reducers ? Increase s3-dist-cp client heap space so that it can handle listing many s3 objects on your source s3 bucket export HADOOP_OPTS="-Xmx5000m -verbose:gc -XX:+UseMembar -XX:+PrintGCDetails -Xloggc:/tmp/gc.log" ; s3-dist-cp --src s3://bucket/object/ --dest s3://dest-bucket/object/

To improve performance to S3, you can use "fs.s3.*" parameters which will alter EMRFS behavior. some parameters you can consider tuning : fs.s3.maxConnections fs.s3.maxRetries (to deal with Throttling from S3) Please note that some of EMRFS parameters may not be existent of not publicly documented. For example how to modify Listing behavior of EMRFS etc.

So, You might try using s3a:// prefix in your s3-dist-cp command which will invoke S3A File System(which is part of open source Hadoop) instead of EMRFS. s3-dist-cp --src s3a://mybucket/data / --dest hdfs:///my_data --outputCodec=gz --targetSize=128 --groupBy='.*(celery-task-meta).*'

S3A File System parameters are well documented and explained in this article :
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/performance.html This allows to additionally tune File System related parameters to speed up your Job.

Additionally, HDFS Write performance tuning can be considered if needed. But we rarely see performance issues with HDFS.

profile pictureAWS
answered a year ago
AWS
SUPPORT ENGINEER
reviewed a day ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions