Glue job keeps running and does not write results
I have created a job to migrate my Postgres data to S3, I am implementing full load right now, the table consists of a lot of records(count-17496724), that is why I added 10 workers along with auto scaling option checked, but i keep getting this error. The job continues to run for long and does not generate any output. Have tried other combination of worker numbers too like 5,10, but same error. Below are the errors from the Logs:
ERROR dispatcher-CoarseGrainedScheduler scheduler.TaskSchedulerImpl (Logging.scala:logError(73)): Lost executor 1 on 10.0.4.209: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
ERROR dispatcher-CoarseGrainedScheduler scheduler.TaskSchedulerImpl (Logging.scala:logError(73)): Lost executor 2 on 10.0.4.209: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
About the job: Creating JDBC connection to the Postgres database and writing the data into S3 as parquet files
How do I perform the load, I need to add the incremental logic later.
Hello,
Above error occurs when exceutor requires more momemory to process the data than configured. For G.1x worker Glue uses 10 GB and for G.2x it uses 20 GB. By default spark read operation from JDBC sources are not parallel and use one connection to read the entire data. To resolve this issue, read the JDBC table in parallel.
You can use hashexpression and hashfield along with hashpartition to read the data in parallel using Glue dynamic frame. Please refer to article for detail explanation:
https://aws.amazon.com/premiumsupport/knowledge-center/glue-lost-nodes-rds-s3-migration/
Regarding the incremental load, You can use Glue bookmark feature to read the data from JDBC sources. Please refer to article for the same as there are some prerequisite to use bookmark with JDBC sources.
https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
If you still face any issue, Please feel free to reach out to AWS Premium Support with script and jobrunid and we will be happy to help.
Relevant questions
Data Catalog schema table getting modified when I run my Glue ETL job
asked 7 days agoWhat are the benefits when I run a Glue job inside VPC?
Accepted Answerasked a month agofailing was glue job after upgrade
asked 5 days agoAWS Glue ETL Job: IllegalArgumentException: Missing collection name.
asked 5 days agoGlue job s3 file not found exception
asked 5 years agoGlue job keeps running and does not write results
asked a month agoGlue table not showing in console
asked 2 months agoI am trying to write an ETL job to the Data Catalog but its writing the Headers as Data
Accepted Answerasked 2 months agoHow do I get the output of an AWS Glue DataBrew job to be a single CSV file?
Accepted Answerasked a year agoI need to read S3 data, transform and put into Data Catalog. Should I be using a Crawler?
Accepted Answerasked 2 months ago