Error encountered while try to get user data - java.lang.NullPointerExcepti

0

I created a Glue job, and was trying to read a single parquet file (5.2GB) into AWS Glue's dynamic dataframe,

datasource0 = glueContext.create_dynamic_frame.from_options(  
    connection_type="s3",  
    connection_options={"paths": \\["s3://my-bucket-name/path"\]},  
    format="parquet"  
)  
  
then do something around datasource0  

Job info:

  • Spark2.4, Python3, Glue 2.0
  • Worker type G.2x - 8 vCPU, 32G Memory

Errors from CloudWatch:

[1] NullPointerException

2020-11-13 00:27:56,873 ERROR \[readingParquetFooters-ForkJoinPool-1-worker-13] util.UserData (UserData.java:getUserData(70)): Error encountered while try to get user data  
java.lang.NullPointerException  
	at com.amazon.ws.emr.hadoop.fs.shaded.com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:871)  
	at com.amazon.ws.emr.hadoop.fs.shaded.com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2726)  
	at com.amazon.ws.emr.hadoop.fs.util.UserData.getUserData(UserData.java:66)  
	at com.amazon.ws.emr.hadoop.fs.util.UserData.<init>(UserData.java:39)  
	at com.amazon.ws.emr.hadoop.fs.util.UserData.ofDefaultResourceLocations(UserData.java:52)  
	at com.amazon.ws.emr.hadoop.fs.util.AWSSessionCredentialsProviderFactory.buildSTSClient(AWSSessionCredentialsProviderFactory.java:52)  
	at com.amazon.ws.emr.hadoop.fs.util.AWSSessionCredentialsProviderFactory.<clinit>(AWSSessionCredentialsProviderFactory.java:17)  
	at com.amazon.ws.emr.hadoop.fs.rolemapping.DefaultS3CredentialsResolver.resolve(DefaultS3CredentialsResolver.java:22)  
	at com.amazon.ws.emr.hadoop.fs.guice.CredentialsProviderOverrider.override(CredentialsProviderOverrider.java:25)  
	at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.executeOverriders(GlobalS3Executor.java:171)  
	at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:103)  
	at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:189)  
	at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:184)  
	at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.getObjectMetadata(AmazonS3LiteClient.java:96)  
	at com.amazon.ws.emr.hadoop.fs.s3.lite.AbstractAmazonS3Lite.getObjectMetadata(AbstractAmazonS3Lite.java:43)  
	at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:220)  
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:860)  
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1319)  
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:790)  
	at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:207)  
	at org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:65)  
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:498)  
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:476)  
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:544)  
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:538)  
	at org.apache.spark.util.ThreadUtils$$anonfun$3$$anonfun$apply$1.apply(ThreadUtils.scala:287)  
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)  
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)  
	at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)  
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)  
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)  
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)  
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)  

[2] IllegalArgumentException

2020-11-13 00:28:07,339 ERROR \[Executor task launch worker for task 21] executor.Executor (Logging.scala:logError(91)): Exception in task 20.0 in stage 1.0 (TID 21)  
java.lang.IllegalArgumentException: Illegal Capacity: -168  
	at java.util.ArrayList.<init>(ArrayList.java:157)  
	at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1163)  
	at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)  
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)  
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)  
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)  
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)  
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)  
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)  
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)  
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)  
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)  
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)  
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)  
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)  
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)  
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)  
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)  
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)  
	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1817)  
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1168)  
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1168)  
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)  
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)  
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)  
	at org.apache.spark.scheduler.Task.run(Task.scala:121)  
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)  
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)  
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)  
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)  
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)  
	at java.lang.Thread.run(Thread.java:748)  

Any insights? Thanks in advance!

Edited by: jugu on Nov 12, 2020 8:41 PM

Edited by: jugu on Nov 12, 2020 8:43 PM

jugu
asked 3 years ago1307 views
1 Answer
0

Hi, is there any update of this issue?

answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions