SparkSession and S3

0

Hello,

I'm facing an error while loading data from S3 using Spark. First, here is my code :

# Chargement des packages et options de configuration

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.3,databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11 pyspark-shell --conf spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true",spark.hadoop.fs.s3a.endpoint=s3.eu-west-1.amazonaws.com'
# Creating SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession.builder \
    .appName('Image_P8') \
    .config('spark.driver.extraJavaOptions', '-Dio.netty.tryReflectionSetAccessible=true') \
    .config('spark.hadoop.fs.s3a.endpoint', 's3.eu-west-1.amazonaws.com') \
    .getOrCreate()
path = "s3a://ocr-fruits/Test/*"
image_df = spark.read.format("binaryFile") \
  .option("pathGlobFilter", "*.jpg") \
  .option("recursiveFileLookup", "true") \
  .load(path)

image_df.show(2)

However, when I try to load image files using the spark.read.format("binaryFile") method, a NumberFormatException exception is thrown with the message "For input string: '64M'". Here is the error corresponding :

NumberFormatException                     Traceback (most recent call last)
Cell In[65], line 4
      1 image_df = spark.read.format("binaryFile") \
      2   .option("pathGlobFilter", "*.jpg") \
      3   .option("recursiveFileLookup", "true") \
----> 4   .load(path)
      6 image_df.show(2)

File C:\apps\opt\spark-3.4.0-bin-hadoop3\python\pyspark\sql\readwriter.py:300, in DataFrameReader.load(self, path, format, schema, **options)
    298 self.options(**options)
    299 if isinstance(path, str):
--> 300     return self._df(self._jreader.load(path))
    301 elif path is not None:
    302     if type(path) != list:

File C:\apps\opt\spark-3.4.0-bin-hadoop3\python\lib\py4j-0.10.9.7-src.zip\py4j\java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

File C:\apps\opt\spark-3.4.0-bin-hadoop3\python\pyspark\errors\exceptions\captured.py:175, in capture_sql_exception.<locals>.deco(*a, **kw)
    171 converted = convert_exception(e.java_exception)
    172 if not isinstance(converted, UnknownException):
    173     # Hide where the exception came from that shows a non-Pythonic
    174     # JVM exception message.
--> 175     raise converted from None
    176 else:
    177     raise

NumberFormatException: For input string: "64M"

I would greatly appreciate if someone could assist me in understanding the cause of this exception and finding a solution to load image files from S3 using Spark.

Thank you in advance for your help!

asked 10 months ago45 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions