How to copy a large dataset from on-premises Hadoop Cluster to S3?

0

A customer has a Hadoop Cluster in an engineered IBM box with internal InfiniBand connecting the data nodes to the master node. Only the master node (and the slave node) are on the IP network, the data nodes do not have IP addresses assigned and are not reachable from the network. The customer has 50TBs of data (individual files are upto 40GB each, stored in Hive) to be moved to S3. We have Direct Connect in place and we are looking at options to move this data. Time is not a constraint, however the use of Snowball devices has been ruled out for now.

Generally, we could have used DistCp for copying data from the Hadoop cluster S3. However, since the data nodes are not reachable , the DistCp utility will not work. What are the other options that can work?

  • WebHDFS?

  • HttpFS?

  • Any other option to transfer 50 TBs of data, that doesn't involve significant work on the customer side e.g. networking changes

질문됨 5년 전485회 조회
1개 답변
0
수락된 답변

So I understand that the nodes don't have external connectivity except the master, so you cannot run DistCP even inside the cluster.
I think the easiest would be to create a script that runs on the master and takes files onto the local dist and uses the standard aws s3 command line client to uploading (tweaking a bit the bandwidth and parallelism).
The other option if you don't want to do the temporary local copy would be to run DistCp in local mode, so it runs only on the master but can access hdfs and s3 directly.

AFAIK, the web solutions you propose to access the cluster externally, would require the DataNodes to be reachable (the master doesn't actually have the data).
The workaround would be to use some proxy service like Knox but it's too much hassle to handle all the security compared with the option of running a script on the cluster master.

profile pictureAWS
전문가
답변함 5년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠