SparkSession and S3

Hello,

I'm facing an error while loading data from S3 using Spark. First, here is my code :

# Chargement des packages et options de configuration

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.3,databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11 pyspark-shell --conf spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true",spark.hadoop.fs.s3a.endpoint=s3.eu-west-1.amazonaws.com'
# Creating SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession.builder \
    .appName('Image_P8') \
    .config('spark.driver.extraJavaOptions', '-Dio.netty.tryReflectionSetAccessible=true') \
    .config('spark.hadoop.fs.s3a.endpoint', 's3.eu-west-1.amazonaws.com') \
    .getOrCreate()
path = "s3a://ocr-fruits/Test/*"
image_df = spark.read.format("binaryFile") \
  .option("pathGlobFilter", "*.jpg") \
  .option("recursiveFileLookup", "true") \
  .load(path)

image_df.show(2)

However, when I try to load image files using the spark.read.format("binaryFile") method, a NumberFormatException exception is thrown with the message "For input string: '64M'". Here is the error corresponding :

NumberFormatException                     Traceback (most recent call last)
Cell In[65], line 4
      1 image_df = spark.read.format("binaryFile") \
      2   .option("pathGlobFilter", "*.jpg") \
      3   .option("recursiveFileLookup", "true") \
----> 4   .load(path)
      6 image_df.show(2)

File C:\apps\opt\spark-3.4.0-bin-hadoop3\python\pyspark\sql\readwriter.py:300, in DataFrameReader.load(self, path, format, schema, **options)
    298 self.options(**options)
    299 if isinstance(path, str):
--> 300     return self._df(self._jreader.load(path))
    301 elif path is not None:
    302     if type(path) != list:

File C:\apps\opt\spark-3.4.0-bin-hadoop3\python\lib\py4j-0.10.9.7-src.zip\py4j\java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

File C:\apps\opt\spark-3.4.0-bin-hadoop3\python\pyspark\errors\exceptions\captured.py:175, in capture_sql_exception.<locals>.deco(*a, **kw)
    171 converted = convert_exception(e.java_exception)
    172 if not isinstance(converted, UnknownException):
    173     # Hide where the exception came from that shows a non-Pythonic
    174     # JVM exception message.
--> 175     raise converted from None
    176 else:
    177     raise

NumberFormatException: For input string: "64M"

I would greatly appreciate if someone could assist me in understanding the cause of this exception and finding a solution to load image files from S3 using Spark.

Thank you in advance for your help!

トピック

ストレージ

タグ

S3 選択

言語

English

hugodrf88

質問済み 1年前50ビュー

回答なし

新しい順
投票が多い順
コメントが多い順

SparkSession and S3

関連するコンテンツ