How does Glue divide work among workers

0

I am trying to understand how Glue divides up large inputs among the worker nodes.

Example:

  • Say I have a Glue job that will need to process 1,000,000 input records

  • The job is configured to use 5 workers

Is Glue going to send 200,000 records to each of the 5 workers?

Is Glue going to send 'X' number of records to each worker and then send another 'X' records as the workers finish?

I am not so much interested in how to configure this job. I am most interested in how does Glue process? I am going to use this information in the design of my project.

demandé il y a 2 ans2270 vues
2 réponses
0

Glue uses Apache Spark underneath. There are many factors involved in dividing data across workers.

Spark jobs are divided into stages and each stage contains multiple tasks. Each task in a stage is doing same piece of work on different chunks of data.

Each worker (G.1X/G.2X) maps to a Spark Executor where these tasks are executed. Each Executor can run multiple tasks.

While reading the data, factors listed below are considered to determine how the data read is parallelized. Further, intermediate stages and distribution of the data depends on the transformations in the ETL script.

For more information on DAG (Directed Acyclic Graph created by spark engine based on the ETL script), stages in the DAG and tasks in each stage, enable SparkUI and launch a spark history server to understand if the configured workers are utilized optimally. https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-jobs.html https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html

If data source is a JDBC - data is parallelized based on the parameters hashpartition and hashfield/hashexpression. https://docs.aws.amazon.com/glue/latest/dg/run-jdbc-parallel-read-job.html

Depending on the table schema and above configuration parameters in create_dynamic_frame, data is read in parallel across multiple workers.

If data source is S3 - following factors affect the read operation -

  1. File type - csv, json, parquet, ORC, AVRO, etc

  2. Compression type - If files are compressed whether the compression is splittable or not https://stackoverflow.com/questions/14820450/best-splittable-compression-for-hadoop-input-bz2

  3. Size of the files - Suppose file is splittable, data is read in chuncks of 128MB.

Data is read across multiple workers, each task in these workers reading a chunk of data.

For more information - https://spark.apache.org/docs/latest/index.html

AWS
répondu il y a 2 ans
0

Hello,

We could see that you are look for a resource to understand how glue divides work across workers. Below document provides detailed explaining for the same.

https://aws.amazon.com/blogs/big-data/best-practices-to-scale-apache-spark-jobs-and-partition-data-with-aws-glue/

================  

Have a nice day !!!

AWS
INGÉNIEUR EN ASSISTANCE TECHNIQUE
Arun
répondu il y a 2 ans

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions