By using AWS re:Post, you agree to the Terms of Use
/How does Glue divide work among workers/

How does Glue divide work among workers

0

I am trying to understand how Glue divides up large inputs among the worker nodes.

Example:

  • Say I have a Glue job that will need to process 1,000,000 input records

  • The job is configured to use 5 workers

Is Glue going to send 200,000 records to each of the 5 workers?

Is Glue going to send 'X' number of records to each worker and then send another 'X' records as the workers finish?

I am not so much interested in how to configure this job. I am most interested in how does Glue process? I am going to use this information in the design of my project.

2 Answers
0

Glue uses Apache Spark underneath. There are many factors involved in dividing data across workers.

Spark jobs are divided into stages and each stage contains multiple tasks. Each task in a stage is doing same piece of work on different chunks of data.

Each worker (G.1X/G.2X) maps to a Spark Executor where these tasks are executed. Each Executor can run multiple tasks.

While reading the data, factors listed below are considered to determine how the data read is parallelized. Further, intermediate stages and distribution of the data depends on the transformations in the ETL script.

For more information on DAG (Directed Acyclic Graph created by spark engine based on the ETL script), stages in the DAG and tasks in each stage, enable SparkUI and launch a spark history server to understand if the configured workers are utilized optimally. https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-jobs.html https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html

If data source is a JDBC - data is parallelized based on the parameters hashpartition and hashfield/hashexpression. https://docs.aws.amazon.com/glue/latest/dg/run-jdbc-parallel-read-job.html

Depending on the table schema and above configuration parameters in create_dynamic_frame, data is read in parallel across multiple workers.

If data source is S3 - following factors affect the read operation -

  1. File type - csv, json, parquet, ORC, AVRO, etc

  2. Compression type - If files are compressed whether the compression is splittable or not

https://stackoverflow.com/questions/14820450/best-splittable-compression-for-hadoop-input-bz2

  1. Size of the files - Suppose file is splittable, data is read in chuncks of 128MB.

Data is read across multiple workers, each task in these workers reading a chunk of data.

For more information - https://spark.apache.org/docs/latest/index.html

answered a month ago
0

Hello,

We could see that you are look for a resource to understand how glue divides work across workers. Below document provides detailed explaining for the same.

https://aws.amazon.com/blogs/big-data/best-practices-to-scale-apache-spark-jobs-and-partition-data-with-aws-glue/

================  

Have a nice day !!!

answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions