Hello,
I'm facing an error while loading data from S3 using Spark. First, here is my code :
# Chargement des packages et options de configuration
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.3,databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11 pyspark-shell --conf spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true",spark.hadoop.fs.s3a.endpoint=s3.eu-west-1.amazonaws.com'
# Creating SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession.builder \
.appName('Image_P8') \
.config('spark.driver.extraJavaOptions', '-Dio.netty.tryReflectionSetAccessible=true') \
.config('spark.hadoop.fs.s3a.endpoint', 's3.eu-west-1.amazonaws.com') \
.getOrCreate()
path = "s3a://ocr-fruits/Test/*"
image_df = spark.read.format("binaryFile") \
.option("pathGlobFilter", "*.jpg") \
.option("recursiveFileLookup", "true") \
.load(path)
image_df.show(2)
However, when I try to load image files using the spark.read.format("binaryFile") method, a NumberFormatException exception is thrown with the message "For input string: '64M'".
Here is the error corresponding :
NumberFormatException Traceback (most recent call last)
Cell In[65], line 4
1 image_df = spark.read.format("binaryFile") \
2 .option("pathGlobFilter", "*.jpg") \
3 .option("recursiveFileLookup", "true") \
----> 4 .load(path)
6 image_df.show(2)
File C:\apps\opt\spark-3.4.0-bin-hadoop3\python\pyspark\sql\readwriter.py:300, in DataFrameReader.load(self, path, format, schema, **options)
298 self.options(**options)
299 if isinstance(path, str):
--> 300 return self._df(self._jreader.load(path))
301 elif path is not None:
302 if type(path) != list:
File C:\apps\opt\spark-3.4.0-bin-hadoop3\python\lib\py4j-0.10.9.7-src.zip\py4j\java_gateway.py:1322, in JavaMember.__call__(self, *args)
1316 command = proto.CALL_COMMAND_NAME +\
1317 self.command_header +\
1318 args_command +\
1319 proto.END_COMMAND_PART
1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
1323 answer, self.gateway_client, self.target_id, self.name)
1325 for temp_arg in temp_args:
1326 if hasattr(temp_arg, "_detach"):
File C:\apps\opt\spark-3.4.0-bin-hadoop3\python\pyspark\errors\exceptions\captured.py:175, in capture_sql_exception.<locals>.deco(*a, **kw)
171 converted = convert_exception(e.java_exception)
172 if not isinstance(converted, UnknownException):
173 # Hide where the exception came from that shows a non-Pythonic
174 # JVM exception message.
--> 175 raise converted from None
176 else:
177 raise
NumberFormatException: For input string: "64M"
I would greatly appreciate if someone could assist me in understanding the cause of this exception and finding a solution to load image files from S3 using Spark.
Thank you in advance for your help!