1 Answers
1
Hello,
AWS Glue ETL uses Apache spark in backend to process the data in memory. Job,Stages and tasks are the terminologies used in spark distributed processing engine.
Basically Spark stages are the physical unit of execution for the computation of multiple tasks. There are 2 kinds of transformations which take place:
- Narrow Transformations: These are transformations that do not require the process of shuffling. These actions can be executed in a single stage.
Example: map() and filter()
- Wide Transformations: These are transformations that require shuffling across various partitions. Hence it requires different stages to be created for communication across different partitions.
Example: ReduceByKey
To get more understanding about Spark internals you can refer below documentation or other spark resources.[1]
answered 22 days ago
Relevant questions
What is a "stage" in AWS Glue?
asked 23 days agowhat are advantages of running ETL jobs in aws glue?
asked 2 months agoGlue table not showing in console
asked 4 months agoHow to see log of query made by Glue pyspark script against JDBC DB
asked 10 days agoWhat is the best practice to load data to redshift with aws glue ?
asked 3 years agoescape caracter in AWS glue
Accepted Answerasked 7 months agoWhat is the sizekey parameter in AWS Glue Catalog Data properties?
Accepted Answerasked 2 years agoHow to escape a comma in a csv file in AWS Glue?
Accepted AnswerWhat are the benefits when I run a Glue job inside VPC?
Accepted Answerasked 4 months agoWhat's the meaning of an AWS Glue Database name in italic?
Accepted Answerasked 5 months ago