How to find root cause of SparkContext shutdown in AWS Glue job

0

AWS Glue jobs are sometimes failing with the error "An error occurred while calling o84.getDynamicFrame. Job 0 cancelled because SparkContext was shut down caused by Failed to create any executor tasks" At times, the jobs succeed. How do I find the root cause of the SparkContext failure?

Stacktrace: py4j.protocol.Py4JJavaError: An error occurred while calling o84.getDynamicFrame. : org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1(DAGScheduler.scala:1130) at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1$adapted(DAGScheduler.scala:1128) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:1128) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2703) at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2603) at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2111) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1419) at org.apache.spark.SparkContext.stop(SparkContext.scala:2111) at org.apache.spark.SparkContext.$anonfun$new$39(SparkContext.scala:681) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:914) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2238) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2259) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2278) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:477) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:430) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3733) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2762) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3724) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3722) at org.apache.spark.sql.Dataset.head(Dataset.scala:2762) at org.apache.spark.sql.Dataset.take(Dataset.scala:2969) at com.amazonaws.services.glue.JDBCDataSource.getLastRow(DataSource.scala:1089) at com.amazonaws.services.glue.JDBCDataSource.getJdbcJobBookmark(DataSource.scala:929) at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:953) at com.amazonaws.services.glue.DataSource.getDynamicFrame(DataSource.scala:99) at com.amazonaws.services.glue.DataSource.getDynamicFrame$(DataSource.scala:99) at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:714) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750)

demandé il y a 2 ans10879 vues
2 réponses
0

Hello,

You get these errors when there aren't enough IP addresses available for the AWS Glue job. Here are two common reasons why these errors might happen:

  1. When you run a job with connection(Contains your VPC,Subnet), AWS Glue sets up elastic network interfaces that allow your job to connect securely to other resources in the VPC. Each elastic network interface gets a private IP address.

For example: If you're running a job with 20 DPUs, you can calculate the number of IP addresses as follows:

With AWS Glue 2.0/3.0 : 20 DPU = 19 Workers (executors) + 1 master (driver) = 20 IP addresses

  1. Multiple AWS services are using the same subnet. These services might be using many of the subnet's available IP addresses.

You need to make sure that enough ip addresses should be available when you run a Glue job with connection. To mitigate this kind of issue better to use a subnet with more number of free ip addresses.

Reference:

https://aws.amazon.com/premiumsupport/knowledge-center/glue-specified-subnet-free-addresses/

AWS
répondu il y a 2 ans
profile picture
EXPERT
vérifié il y a un mois
0

Reason for failure: Connection starvation from Amazon RDS. RDS could not handle the huge demand for connections. Once this issue was fixed from RDS end (I don't have the details), the Glue jobs ran fine.

répondu il y a 2 ans

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions