- Más nuevo
- Más votos
- Más comentarios
Glue uses Apache Spark underneath. There are many factors involved in dividing data across workers.
Spark jobs are divided into stages and each stage contains multiple tasks. Each task in a stage is doing same piece of work on different chunks of data.
Each worker (G.1X/G.2X) maps to a Spark Executor where these tasks are executed. Each Executor can run multiple tasks.
While reading the data, factors listed below are considered to determine how the data read is parallelized. Further, intermediate stages and distribution of the data depends on the transformations in the ETL script.
For more information on DAG (Directed Acyclic Graph created by spark engine based on the ETL script), stages in the DAG and tasks in each stage, enable SparkUI and launch a spark history server to understand if the configured workers are utilized optimally. https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-jobs.html https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html
If data source is a JDBC - data is parallelized based on the parameters hashpartition and hashfield/hashexpression. https://docs.aws.amazon.com/glue/latest/dg/run-jdbc-parallel-read-job.html
Depending on the table schema and above configuration parameters in create_dynamic_frame, data is read in parallel across multiple workers.
If data source is S3 - following factors affect the read operation -
-
File type - csv, json, parquet, ORC, AVRO, etc
-
Compression type - If files are compressed whether the compression is splittable or not https://stackoverflow.com/questions/14820450/best-splittable-compression-for-hadoop-input-bz2
-
Size of the files - Suppose file is splittable, data is read in chuncks of 128MB.
Data is read across multiple workers, each task in these workers reading a chunk of data.
For more information - https://spark.apache.org/docs/latest/index.html
Hello,
We could see that you are look for a resource to understand how glue divides work across workers. Below document provides detailed explaining for the same.
================
Have a nice day !!!
Contenido relevante
- OFICIAL DE AWSActualizada hace 2 años
- OFICIAL DE AWSActualizada hace 3 años
- OFICIAL DE AWSActualizada hace un año