How to find root cause of SparkContext shutdown in AWS Glue job

0

AWS Glue jobs are sometimes failing with the error "An error occurred while calling o84.getDynamicFrame. Job 0 cancelled because SparkContext was shut down caused by Failed to create any executor tasks" At times, the jobs succeed. How do I find the root cause of the SparkContext failure?

Stacktrace: py4j.protocol.Py4JJavaError: An error occurred while calling o84.getDynamicFrame. : org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1(DAGScheduler.scala:1130) at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1$adapted(DAGScheduler.scala:1128) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:1128) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2703) at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2603) at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2111) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1419) at org.apache.spark.SparkContext.stop(SparkContext.scala:2111) at org.apache.spark.SparkContext.$anonfun$new$39(SparkContext.scala:681) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:914) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2238) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2259) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2278) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:477) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:430) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3733) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2762) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3724) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3722) at org.apache.spark.sql.Dataset.head(Dataset.scala:2762) at org.apache.spark.sql.Dataset.take(Dataset.scala:2969) at com.amazonaws.services.glue.JDBCDataSource.getLastRow(DataSource.scala:1089) at com.amazonaws.services.glue.JDBCDataSource.getJdbcJobBookmark(DataSource.scala:929) at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:953) at com.amazonaws.services.glue.DataSource.getDynamicFrame(DataSource.scala:99) at com.amazonaws.services.glue.DataSource.getDynamicFrame$(DataSource.scala:99) at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:714) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750)

asked 2 years ago10670 views
2 Answers
0

Hello,

You get these errors when there aren't enough IP addresses available for the AWS Glue job. Here are two common reasons why these errors might happen:

  1. When you run a job with connection(Contains your VPC,Subnet), AWS Glue sets up elastic network interfaces that allow your job to connect securely to other resources in the VPC. Each elastic network interface gets a private IP address.

For example: If you're running a job with 20 DPUs, you can calculate the number of IP addresses as follows:

With AWS Glue 2.0/3.0 : 20 DPU = 19 Workers (executors) + 1 master (driver) = 20 IP addresses

  1. Multiple AWS services are using the same subnet. These services might be using many of the subnet's available IP addresses.

You need to make sure that enough ip addresses should be available when you run a Glue job with connection. To mitigate this kind of issue better to use a subnet with more number of free ip addresses.

Reference:

https://aws.amazon.com/premiumsupport/knowledge-center/glue-specified-subnet-free-addresses/

AWS
answered 2 years ago
profile picture
EXPERT
reviewed 23 days ago
0

Reason for failure: Connection starvation from Amazon RDS. RDS could not handle the huge demand for connections. Once this issue was fixed from RDS end (I don't have the details), the Glue jobs ran fine.

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions

Relevant content