What is a "stage" in AWS Glue?

0

Hi,

I see references made to "stages" in AWS Glue (e.g. https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-straggler.html), but I'm unclear as to what is actually meant by a "stage" and can't find an official definition anywhere.

Is a stage each individual operation against a DataFrame in a PySpark script for example?

Thanks in advance,

cgddrd
asked 2 years ago566 views
1 Answer
1

Hello,

AWS Glue ETL uses Apache spark in backend to process the data in memory. Job,Stages and tasks are the terminologies used in spark distributed processing engine.

Basically Spark stages are the physical unit of execution for the computation of multiple tasks. There are 2 kinds of transformations which take place:

  1. Narrow Transformations: These are transformations that do not require the process of shuffling. These actions can be executed in a single stage.

Example: map() and filter()

  1. Wide Transformations: These are transformations that require shuffling across various partitions. Hence it requires different stages to be created for communication across different partitions.

Example: ReduceByKey

To get more understanding about Spark internals you can refer below documentation or other spark resources.[1]

[1] https://www.educba.com/spark-stages/

AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions