How to find root cause of SparkContext shutdown in AWS Glue job

0

AWS Glue jobs are sometimes failing with the error "An error occurred while calling o84.getDynamicFrame. Job 0 cancelled because SparkContext was shut down caused by Failed to create any executor tasks" At times, the jobs succeed. How do I find the root cause of the SparkContext failure?

Stacktrace: py4j.protocol.Py4JJavaError: An error occurred while calling o84.getDynamicFrame. : org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1(DAGScheduler.scala:1130) at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1$adapted(DAGScheduler.scala:1128) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:1128) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2703) at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2603) at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2111) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1419) at org.apache.spark.SparkContext.stop(SparkContext.scala:2111) at org.apache.spark.SparkContext.$anonfun$new$39(SparkContext.scala:681) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:914) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2238) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2259) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2278) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:477) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:430) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3733) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2762) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3724) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3722) at org.apache.spark.sql.Dataset.head(Dataset.scala:2762) at org.apache.spark.sql.Dataset.take(Dataset.scala:2969) at com.amazonaws.services.glue.JDBCDataSource.getLastRow(DataSource.scala:1089) at com.amazonaws.services.glue.JDBCDataSource.getJdbcJobBookmark(DataSource.scala:929) at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:953) at com.amazonaws.services.glue.DataSource.getDynamicFrame(DataSource.scala:99) at com.amazonaws.services.glue.DataSource.getDynamicFrame$(DataSource.scala:99) at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:714) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750)

posta 2 anni fa10879 visualizzazioni
2 Risposte
0

Hello,

You get these errors when there aren't enough IP addresses available for the AWS Glue job. Here are two common reasons why these errors might happen:

  1. When you run a job with connection(Contains your VPC,Subnet), AWS Glue sets up elastic network interfaces that allow your job to connect securely to other resources in the VPC. Each elastic network interface gets a private IP address.

For example: If you're running a job with 20 DPUs, you can calculate the number of IP addresses as follows:

With AWS Glue 2.0/3.0 : 20 DPU = 19 Workers (executors) + 1 master (driver) = 20 IP addresses

  1. Multiple AWS services are using the same subnet. These services might be using many of the subnet's available IP addresses.

You need to make sure that enough ip addresses should be available when you run a Glue job with connection. To mitigate this kind of issue better to use a subnet with more number of free ip addresses.

Reference:

https://aws.amazon.com/premiumsupport/knowledge-center/glue-specified-subnet-free-addresses/

AWS
con risposta 2 anni fa
profile picture
ESPERTO
verificato un mese fa
0

Reason for failure: Connection starvation from Amazon RDS. RDS could not handle the huge demand for connections. Once this issue was fixed from RDS end (I don't have the details), the Glue jobs ran fine.

con risposta 2 anni fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande