- Newest
- Most votes
- Most comments
Hello Laurent,
I have used the zipcodes.json file available here https://raw.githubusercontent.com/spark-examples/spark-scala-examples/master/src/main/resources/zipcodes.json (referenced from the above) and the following code snippet on EMR release 6.5.0 but was unable to reproduce the issue.
============ <code snippet> ============
spark
.read
.format("s3selectJSON")
.load("s3://<YourBucketLocation>/zipcodes.json")
To answer your question, we require details that are non-public information. For example, a sample of your data, code snippet of how you are reading the file, EMR cluster ID and YARN application ID. Please can you open a support case with AWS using this link https://console.aws.amazon.com/support/home#/case/create so we can continue to assist you.
Best regards,
Eshendren_M.
Hi
I tested the following on emr 6.5 cluster and it works on my cluster:
I SSHed into the master instance of the cluster.
Uploaded a zipcodes.json file available here https://raw.githubusercontent.com/spark-examples/spark-scala-examples/master/src/main/resources/zipcodes.json to my s3 bucket
$ wget https://raw.githubusercontent.com/spark-examples/spark-scala-examples/master/src/main/resources/zipcodes.json
$ file -i zipcodes.json
zipcodes.json: text/plain; charset=us-ascii
$ aws s3 cp zipcodes.json s3://<path>/
upload: ./zipcodes.json to s3://<path>/zipcodes.json
$ spark-shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/emr/emrfs/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/redshift/jdbc/redshift-jdbc42-1.2.37.1061.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/13 10:56:51 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
22/12/13 10:57:14 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
Spark context Web UI available at http://ip-172-31-23-27.ap-southeast-2.compute.internal:4040
Spark context available as 'sc' (master = yarn, app id = application_1670910839374_0002).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.1.2-amzn-1
/_/
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_352)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.read.format("s3selectJSON").load("s3://<bucketpath>/zipcodes.json");
res0: org.apache.spark.sql.DataFrame = [City: string, Country: string ... 18 more fields]
scala> res0.select("City").count();
res1: Long = 21
Please check for any parsing errors in the json file. Also check if you are able to parse this file in general.
Relevant content
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 10 months ago
