How does Glue divide work among workers

0

I am trying to understand how Glue divides up large inputs among the worker nodes.

Example:

  • Say I have a Glue job that will need to process 1,000,000 input records

  • The job is configured to use 5 workers

Is Glue going to send 200,000 records to each of the 5 workers?

Is Glue going to send 'X' number of records to each worker and then send another 'X' records as the workers finish?

I am not so much interested in how to configure this job. I am most interested in how does Glue process? I am going to use this information in the design of my project.

已提問 2 年前檢視次數 2270 次
2 個答案
0

Glue uses Apache Spark underneath. There are many factors involved in dividing data across workers.

Spark jobs are divided into stages and each stage contains multiple tasks. Each task in a stage is doing same piece of work on different chunks of data.

Each worker (G.1X/G.2X) maps to a Spark Executor where these tasks are executed. Each Executor can run multiple tasks.

While reading the data, factors listed below are considered to determine how the data read is parallelized. Further, intermediate stages and distribution of the data depends on the transformations in the ETL script.

For more information on DAG (Directed Acyclic Graph created by spark engine based on the ETL script), stages in the DAG and tasks in each stage, enable SparkUI and launch a spark history server to understand if the configured workers are utilized optimally. https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-jobs.html https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html

If data source is a JDBC - data is parallelized based on the parameters hashpartition and hashfield/hashexpression. https://docs.aws.amazon.com/glue/latest/dg/run-jdbc-parallel-read-job.html

Depending on the table schema and above configuration parameters in create_dynamic_frame, data is read in parallel across multiple workers.

If data source is S3 - following factors affect the read operation -

  1. File type - csv, json, parquet, ORC, AVRO, etc

  2. Compression type - If files are compressed whether the compression is splittable or not https://stackoverflow.com/questions/14820450/best-splittable-compression-for-hadoop-input-bz2

  3. Size of the files - Suppose file is splittable, data is read in chuncks of 128MB.

Data is read across multiple workers, each task in these workers reading a chunk of data.

For more information - https://spark.apache.org/docs/latest/index.html

AWS
已回答 2 年前
0

Hello,

We could see that you are look for a resource to understand how glue divides work across workers. Below document provides detailed explaining for the same.

https://aws.amazon.com/blogs/big-data/best-practices-to-scale-apache-spark-jobs-and-partition-data-with-aws-glue/

================  

Have a nice day !!!

AWS
支援工程師
Arun
已回答 2 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南