I want to use alternative storage options for EMR Serverless.
Resolution
EMR Serverless doesn't support Hadoop Distributed File System (HDFS). The following storage options are available to use for EMR Serverless:
Local disks for temporary storage
Local disks on EMR Serverless workers serve as temporal storage for data that needs to be shuffled and processed when a job runs. The local disks have a hard 200GB space limit. If a job includes a large disk spill and the shuffle data exceeds the 200GB limit, then the job fails because of insufficient space.
Amazon S3 for secondary log storage
Use Amazon S3 as a secondary storage for EMR Serverless jobs that require input data and storage for processed output data. EMR Serverless uses Amazon Simple Storage Service (Amazon S3) for log storage instead of worker temporal storage. This allows users to store logs for debugging purposes in an Amazon S3 bucket. When you store logs in Amazon S3, you can set custom log retention policies and custom security policies for application logs. This option limits Amazon EMR's troubleshooting capabilities for jobs that you submit.
To submit a job to run your EMR Serverless application and view the jobs, see Step 2: Submit a job run or interactive workload.
To read data on Amazon S3, submit the following Python script as a job on EMR Serverless:
Note: Replace example-bucket-name with your Amazon S3 bucket and example-csv-data with your .csv data file.
from pyspark.sql import SparkSession
#Initialize Spark Session
spark = SparkSession.builder.appName("ReadfromS3").getOrCreate()
#Specify the S3 path to your data
s3_path = "s3://example-bucket-name/emrserverless/example-csv-data"
#Read the data into a Dataframe
df = spark.read.csv(s3_path, header=True, inferSchema=True)
#Show the first few rows of the Dataframe
df.show()
Best practices for data processing on EMR Serverless
The following are best practices for data processing in EMR Serverless:
- To reduce shuffle data, optimize your job configurations and code.
- To stay within your 200GB limit, use techniques such as partitioning, bucketing, and reducing the amount of intermediate data.
- To save storage space and improve your processing efficiency, use data compression formats such as Parquet or ORC. You can significantly reduce the amount of data that's written to a disk during a job run when you compress shuffle data.
Related information
Other considerations