- Newest
- Most votes
- Most comments
The parameters you're using in your AWS Glue job are related to the AWS Glue shuffle manager, which allows Spark to use Amazon S3 for storing shuffle data instead of local disk storage.
Here's what each parameter does:
-
--write-shuffle-files-to-s3with valueTRUE: This is the main flag that enables the AWS Glue Spark shuffle manager to use Amazon S3 buckets for writing and reading shuffle data. This helps overcome disk capacity issues that can occur during large shuffling operations. -
--write-shuffle-spills-to-s3with valueTRUE: This optional flag allows you to offload spill files to Amazon S3 buckets, providing additional resiliency to your Spark job. This is particularly useful for large workloads that spill a lot of data to disk.
Regarding your question about not specifying an S3 bucket: If you don't explicitly specify a bucket location, AWS Glue will use a default location. By default, the shuffle files will be written to a shuffle-data folder in the location specified by your job's --TempDir parameter.
If you want to specify a custom S3 bucket for shuffle data, you can add this additional parameter:
--conf spark.shuffle.glue.s3ShuffleBucket=s3://<shuffle-bucket>
Where <shuffle-bucket> is the name of your S3 bucket. In AWS Glue 3.0 and later, you can even specify multiple buckets using a comma delimiter to improve performance:
--conf spark.shuffle.glue.s3ShuffleBucket=s3://shuffle-bucket-1/prefix,s3://shuffle-bucket-2/prefix/
Remember that you'll need to ensure appropriate security configuration settings if you want to enable encryption at-rest for the shuffle data.
Sources
Managing ETL jobs with AWS Glue Studio - AWS Glue
Building a reliable data pipeline - AWS Glue Best Practices: Building a Secure and Reliable Data Pipeline
Storing Spark shuffle data - AWS Glue
Relevant content
- asked 3 years ago
- asked 2 years ago
- asked a year ago
