AWS EMR (HDFS + Spark) - AWS EMR (Spark)

0

Hi, According to the two options, what is the difference between them when creating Data Lake ?

posix
質問済み 2年前501ビュー
2回答
2
承認された回答

HDFS is a distributed, scalable, and portable file system for Hadoop. An advantage of HDFS is data awareness between the Hadoop cluster nodes managing the clusters and the Hadoop cluster nodes managing the individual steps. For more information, see Hadoop documentation.

HDFS is used by the master and core nodes. One advantage is that it's fast; a disadvantage is that it's ephemeral storage which is reclaimed when the cluster ends. It's best used for caching the results produced by the immediate job flow steps. https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

回答済み 2年前
profile picture
エキスパート
レビュー済み 22日前
AWS
サポートエンジニア
レビュー済み 2ヶ月前
  • @lowflyinghawk, for a AWS EMR cluster (HDFS + Spark), as it's ephemeral storage which is reclaimed when the cluster ends, is it a good idea to save result to s3 after processing data with spark ?

  • @posix, yes. btw, nice username.

  • @lowflyinghawk, thank you. Just to motivate me and to remember me to stay positif and determined man. Listen, if I understand well, HDFS is for its capacity to provide high performance for accessing data; and spark to allow to make distributed calcul ?

1

Just to provide bit more context. When we say Hadoop or Bigdata its a framework that allows for the distributed processing of large data sets . It primarily comprise of a Distributed storage layer (HDFS) and a compute/processing layer (mapreduce, Hive, Spark etc..). Hence most of the framework (Hive, tez, Hbase, Spark etc..) build around Hadoop/Bigdata framework are designed to read/write/process data from a distributed storage layer . In the early days it was primarily HDFS however over the period of time all these framework started supporting (and providing better integrations) with other distributed storage systems (mainly on cloud storage like S3).

So to answer your query yes HDFS provides the distributed Storage layer, and provides slightly faster data access due to local storage and less network transfer compared external storage system. It was mainly preferable when using long running hadoop cluster. However when using transient Hadoop/EMR cluster HDFS storage layer is lost when cluster is terminated. Hence it is recommended to use S3 as persistent storage layer .

AWS
回答済み 2年前
AWS
サポートエンジニア
レビュー済み 2ヶ月前
  • and yes Spark is the compute engine , that supports a wide range of applications, including ETL, machine learning, stream processing, and graph .

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ