Glue job keeps running and does not write results
I have created a job to migrate my Postgres data to S3, I am implementing full load right now, the table consists of a lot of records(count-17496724), that is why I added 10 workers along with auto scaling option checked, but i keep getting this error. The job continues to run for long and does not generate any output. Have tried other combination of worker numbers too like 5,10, but same error. Below are the errors from the Logs:
ERROR dispatcher-CoarseGrainedScheduler scheduler.TaskSchedulerImpl (Logging.scala:logError(73)): Lost executor 1 on 10.0.4.209: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
ERROR dispatcher-CoarseGrainedScheduler scheduler.TaskSchedulerImpl (Logging.scala:logError(73)): Lost executor 2 on 10.0.4.209: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
About the job: Creating JDBC connection to the Postgres database and writing the data into S3 as parquet files
How do I perform the load, I need to add the incremental logic later.
Above error occurs when exceutor requires more momemory to process the data than configured. For G.1x worker Glue uses 10 GB and for G.2x it uses 20 GB. By default spark read operation from JDBC sources are not parallel and use one connection to read the entire data. To resolve this issue, read the JDBC table in parallel.
You can use hashexpression and hashfield along with hashpartition to read the data in parallel using Glue dynamic frame. Please refer to article for detail explanation:
Regarding the incremental load, You can use Glue bookmark feature to read the data from JDBC sources. Please refer to article for the same as there are some prerequisite to use bookmark with JDBC sources.
If you still face any issue, Please feel free to reach out to AWS Premium Support with script and jobrunid and we will be happy to help.
Data Catalog schema table getting modified when I run my Glue ETL jobasked 7 days ago
What are the benefits when I run a Glue job inside VPC?Accepted Answerasked a month ago
failing was glue job after upgradeasked 5 days ago
AWS Glue ETL Job: IllegalArgumentException: Missing collection name.asked 5 days ago
Glue job s3 file not found exceptionasked 5 years ago
Glue job keeps running and does not write resultsasked a month ago
Glue table not showing in consoleasked 2 months ago
I am trying to write an ETL job to the Data Catalog but its writing the Headers as DataAccepted Answer
How do I get the output of an AWS Glue DataBrew job to be a single CSV file?Accepted Answerasked a year ago
I need to read S3 data, transform and put into Data Catalog. Should I be using a Crawler?Accepted Answer