Skip to content

S3 Select JSON Spark fails with mandatory key is not supported by FileSystemDataInputStreamBuilder

0

Hello, I am trying to read a JSON file on S3 using https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html but I receive following exception I am not able to troubleshoot

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (ip-172-30-3-67.eu-west-1.compute.internal executor 1): java.lang.IllegalArgumentException: mandatory key is not supported by FileSystemDataInputStreamBuilder
        at com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
        at org.apache.hadoop.fs.FileSystem$FileSystemDataInputStreamBuilder.build(FileSystem.java:4392)
        at com.amazonaws.emr.s3select.spark.json.S3SelectJsonLineRecordReader.initialize(S3SelectJsonLineRecordReader.java:79)
        at com.amazonaws.emr.s3select.spark.json.JsonS3SelectFileLinesReader.<init>(JsonS3SelectFileLinesReader.scala:59)
        at com.amazonaws.emr.s3select.spark.json.JsonS3SelectDataSource.readFile(JsonS3SelectDataSource.scala:61)
        at com.amazonaws.emr.s3select.spark.json.JsonS3SelectFileFormat.$anonfun$buildReader$3(JsonS3SelectFileFormat.scala:114)
        at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
        at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:185)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:240)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)

When I use s3selectCSV format with a CVS file I don't have any issue. I am running EMR 6.5. Any suggestions ? Thanks in advance.

Laurent

asked 3 years ago558 views
2 Answers
0

Hello Laurent,

I have used the zipcodes.json file available here https://raw.githubusercontent.com/spark-examples/spark-scala-examples/master/src/main/resources/zipcodes.json (referenced from the above) and the following code snippet on EMR release 6.5.0 but was unable to reproduce the issue.

============ <code snippet> ============

spark

.read

.format("s3selectJSON")

.load("s3://<YourBucketLocation>/zipcodes.json")

To answer your question, we require details that are non-public information. For example, a sample of your data, code snippet of how you are reading the file, EMR cluster ID and YARN application ID. Please can you open a support case with AWS using this link https://console.aws.amazon.com/support/home#/case/create so we can continue to assist you.

Best regards,

Eshendren_M.

AWS
answered 3 years ago
0

Hi

I tested the following on emr 6.5 cluster and it works on my cluster:

I SSHed into the master instance of the cluster.

Uploaded a zipcodes.json file available here https://raw.githubusercontent.com/spark-examples/spark-scala-examples/master/src/main/resources/zipcodes.json to my s3 bucket

$ wget https://raw.githubusercontent.com/spark-examples/spark-scala-examples/master/src/main/resources/zipcodes.json 

$ file -i zipcodes.json 
zipcodes.json: text/plain; charset=us-ascii

$ aws s3 cp zipcodes.json s3://<path>/
upload: ./zipcodes.json to s3://<path>/zipcodes.json

$ spark-shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/emr/emrfs/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/redshift/jdbc/redshift-jdbc42-1.2.37.1061.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/13 10:56:51 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
22/12/13 10:57:14 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
Spark context Web UI available at http://ip-172-31-23-27.ap-southeast-2.compute.internal:4040
Spark context available as 'sc' (master = yarn, app id = application_1670910839374_0002).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2-amzn-1
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_352)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.format("s3selectJSON").load("s3://<bucketpath>/zipcodes.json");
res0: org.apache.spark.sql.DataFrame = [City: string, Country: string ... 18 more fields]

scala> res0.select("City").count();
res1: Long = 21

Please check for any parsing errors in the json file. Also check if you are able to parse this file in general.

answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.