- 最新
- 最多得票
- 最多評論
s3-dist-cp
uses Hadoop Map Reduce to do the copy job. when reading or writing to s3, it would use EMRFS to make GET / PUT / LIST calls to S3.
So, to tune performance of your job runtime,
- you have to be aware of how Hadoop works and how it integrates with YARN.
- Also tuning can be performed on the File System(S3 and HDFS) to improve Read / Write / Listing performance.
benchmarking need to be done to really understand if 1hr and 38m
is normal time for this cluster size.
You can monitor the Mapreduce job using YARN Resource Manager UI & MR Job history server logs to identify where bulk of the time is being spent.
-
Are any containers(Mappers or Reducers) in PENDING state waiting for resources to be assigned by YARN ? Is Mapper / reducers running into memory Issues ? in which case you need bigger cluster or tune Map Reduce memory settings )
-
is s3-dist-cp spending too much time Listing S3 objects before even running Mappers / Reducers ? Increase s3-dist-cp client heap space so that it can handle listing many s3 objects on your source s3 bucket
export HADOOP_OPTS="-Xmx5000m -verbose:gc -XX:+UseMembar -XX:+PrintGCDetails -Xloggc:/tmp/gc.log" ; s3-dist-cp --src s3://bucket/object/ --dest s3://dest-bucket/object/
To improve performance to S3, you can use "fs.s3.*" parameters which will alter EMRFS behavior. some parameters you can consider tuning : fs.s3.maxConnections fs.s3.maxRetries (to deal with Throttling from S3) Please note that some of EMRFS parameters may not be existent of not publicly documented. For example how to modify Listing behavior of EMRFS etc.
So,
You might try using s3a://
prefix in your s3-dist-cp command which will invoke S3A File System(which is part of open source Hadoop) instead of EMRFS.
s3-dist-cp --src s3a://mybucket/data / --dest hdfs:///my_data --outputCodec=gz --targetSize=128 --groupBy='.*(celery-task-meta).*'
S3A File System parameters are well documented and explained in this article :
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/performance.html
This allows to additionally tune File System related parameters to speed up your Job.
Additionally, HDFS Write performance tuning can be considered if needed. But we rarely see performance issues with HDFS.
相關內容
- AWS 官方已更新 2 年前
- AWS 官方已更新 2 年前