s3-dist-cp expected performance

0

I am trying to evaluate whether s3-dist-cp tool would be helpful for the following use case: I have several millions of small text files (with size of tens of KB each) in an S3 bucket that I need to concatenate to bigger files before processing them further with Spark. To test s3-dist-cp, I tried it first on a smaller bucket with ~550000 files (~6.8GB total size). I launched an EMR cluster with 15 core nodes (m6g.xlarge instance type, 4 VCPUs / 16GB RAM) and ran the tool with a command like the following:

s3-dist-cp --src s3://mybucket/data/ --dest hdfs:///my_data --outputCodec=gz --targetSize=128 --groupBy='.*(celery-task-meta).*'

This took 1hr and 38m to complete.. is this kind of duration expected / normal? Is there anything I could do to speed it up?

Thanks in advance!

nikos64
posta 2 anni fa245 visualizzazioni
1 Risposta
0
Risposta accettata

s3-dist-cp uses Hadoop Map Reduce to do the copy job. when reading or writing to s3, it would use EMRFS to make GET / PUT / LIST calls to S3. So, to tune performance of your job runtime,

  • you have to be aware of how Hadoop works and how it integrates with YARN.
  • Also tuning can be performed on the File System(S3 and HDFS) to improve Read / Write / Listing performance.

benchmarking need to be done to really understand if 1hr and 38m is normal time for this cluster size.

You can monitor the Mapreduce job using YARN Resource Manager UI & MR Job history server logs to identify where bulk of the time is being spent.

  • Are any containers(Mappers or Reducers) in PENDING state waiting for resources to be assigned by YARN ? Is Mapper / reducers running into memory Issues ? in which case you need bigger cluster or tune Map Reduce memory settings )

  • is s3-dist-cp spending too much time Listing S3 objects before even running Mappers / Reducers ? Increase s3-dist-cp client heap space so that it can handle listing many s3 objects on your source s3 bucket export HADOOP_OPTS="-Xmx5000m -verbose:gc -XX:+UseMembar -XX:+PrintGCDetails -Xloggc:/tmp/gc.log" ; s3-dist-cp --src s3://bucket/object/ --dest s3://dest-bucket/object/

To improve performance to S3, you can use "fs.s3.*" parameters which will alter EMRFS behavior. some parameters you can consider tuning : fs.s3.maxConnections fs.s3.maxRetries (to deal with Throttling from S3) Please note that some of EMRFS parameters may not be existent of not publicly documented. For example how to modify Listing behavior of EMRFS etc.

So, You might try using s3a:// prefix in your s3-dist-cp command which will invoke S3A File System(which is part of open source Hadoop) instead of EMRFS. s3-dist-cp --src s3a://mybucket/data / --dest hdfs:///my_data --outputCodec=gz --targetSize=128 --groupBy='.*(celery-task-meta).*'

S3A File System parameters are well documented and explained in this article :
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/performance.html This allows to additionally tune File System related parameters to speed up your Job.

Additionally, HDFS Write performance tuning can be considered if needed. But we rarely see performance issues with HDFS.

profile pictureAWS
con risposta un anno fa
AWS
TECNICO DI SUPPORTO
verificato 11 giorni fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande