Glue Error: java.lang.IllegalStateException: Trying to set metadata after initializing

0

I have a glue job that transforms data from glue table. And I encounter the following error. It does not occur for every run of the job.

I have looked at a few documentarians, it seems to be coming from how spark interacts with the underlying S3 file system. Any insights into why this occurs?

An error occurred while calling o1196.count. : java.lang.IllegalStateException: Trying to set metadata after initializing. at com.google.common.base.Preconditions.checkState(Preconditions.java:149) at org.apache.hadoop.fs.FileStatus.initializeMetadata(FileStatus.java:444) at org.apache.hadoop.fs.FileStatus.ensureMetadataInitialized(FileStatus.java:429) at org.apache.hadoop.fs.FileStatus.getMetadata(FileStatus.java:409) at org.apache.parquet.hadoop.util.DefaultETagExtractor.getETag(DefaultETagExtractor.java:27) at org.apache.spark.sql.execution.PartitionedFileUtil$.$anonfun$splitFiles$1(PartitionedFileUtil.scala:50) at org.apache.spark.sql.execution.PartitionedFileUtil$.$anonfun$splitFiles$1$adapted(PartitionedFileUtil.scala:46) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233) at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:70) at scala.collection.TraversableLike.map(TraversableLike.scala:233) at scala.collection.TraversableLike.map$(TraversableLike.scala:226) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.execution.PartitionedFileUtil$.splitFiles(PartitionedFileUtil.scala:46) at org.apache.spark.sql.execution.FileSourceScanExec.$anonfun$createNonBucketedReadRDD$2(DataSourceScanExec.scala:708) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:240) at scala.collection.immutable.List.foreach(List.scala:388) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:240) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:237) at scala.collection.immutable.List.flatMap(List.scala:351) at org.apache.spark.sql.execution.FileSourceScanExec.$anonfun$createNonBucketedReadRDD$1(DataSourceScanExec.scala:697) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:240) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:194) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:240) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:237) at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:194) at org.apache.spark.sql.execution.FileSourceScanExec.createNonBucketedReadRDD(DataSourceScanExec.scala:696) at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:469) at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:408) at org.apache.spark.sql.execution.FileSourceScanExec.doExecuteColumnar(DataSourceScanExec.scala:600) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:212) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220) at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:208) at org.apache.spark.sql.execution.InputAdapter.doExecuteColumnar(WholeStageCodegenExec.scala:519) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:212) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220) at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:208) at org.apache.spark.sql.execution.ColumnarToRowExec.inputRDDs(Columnar.scala:202) at org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:149) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:746) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:326) at org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:402) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.org$apache$spark$sql$execution$exchange$BroadcastExchangeExec$$doComputeRelation(BroadcastExchangeExec.scala:178) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anon$1.doCompute(BroadcastExchangeExec.scala:171) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anon$1.doCompute(BroadcastExchangeExec.scala:167) at org.apache.spark.sql.execution.AsyncDriverOperation.$anonfun$compute$1(AsyncDriverOperation.scala:73) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withExecutionId$1(SQLExecution.scala:207) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:204) at org.apache.spark.sql.execution.AsyncDriverOperation.compute(AsyncDriverOperation.scala:67) at org.apache.spark.sql.execution.AsyncDriverOperation.$anonfun$computeFuture$1(AsyncDriverOperation.scala:53) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:275) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.checkNoFailures(AdaptiveExecutor.scala:147) at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.doRun(AdaptiveExecutor.scala:88) at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.tryRunningAndGetFuture(AdaptiveExecutor.scala:66) at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.execute(AdaptiveExecutor.scala:57) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:183) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:404) at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3046) at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3045) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3724) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3722) at org.apache.spark.sql.Dataset.count(Dataset.scala:3045) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750)

YFL
asked 2 months ago323 views
1 Answer
0

That looks like a racing condition by concurrent access to the same FileStatus object, which is not thread safe.
Assuming you are not doing nothing odd, seems you have hit an obscure bug and will have to accept it needs to retry (it should fail early so doesn't cause delay).

profile pictureAWS
EXPERT
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions