Saltar al contenido

com.amazonaws.services.s3.model.AmazonS3Exception: max-keys must be between 1 and 1000

0

Getting this issue in Amazon EMR during a pyspark job execution.

df = spark.read.parquet("s3a://test/raw-billing-cor-data/cur2/123456789/cid-cur2/data/BILLING_PERIOD=2025-08/")

py4j.protocol.Py4JJavaError: An error occurred while calling o93.parquet.
: org.apache.hadoop.fs.s3a.AWSBadRequestException: listStatus on s3a://test/raw-billing-cor-data/cur2/123456789/cid-cur2/data/BILLING_PERIOD=2025-08/: com.amazonaws.services.s3.model.AmazonS3Exception: max-keys must be between 1 and 1000 (Service: Amazon S3; Status Code: 400; Error Code: InvalidArgument; Request ID: null; S3 Extended Request ID: null; Proxy: null), S3 Extended Request ID: null:InvalidArgument: max-keys must be between 1 and 1000 (Service: Amazon S3; Status Code: 400; Error Code: InvalidArgument; Request ID: null; S3 Extended Request ID: null; Proxy: null)
        at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:249)

Spark version - 3.5.0 Tried setting below confs.

    spark.conf.set("spark.hadoop.fs.s3a.max-keys", "1000")
    spark.conf.set("spark.hadoop.fs.s3a.list.version.max-keys", "1000")
    spark.conf.set("spark.hadoop.fs.s3a.paging.maximum", "1000")
    spark.conf.set("spark.hadoop.fs.s3a.listing.max-keys", "1000")  # This one might be missing
    spark.conf.set("spark.hadoop.fs.s3a.multipart.size", "67108864")  # 64MB chunks
    spark.conf.set("spark.hadoop.fs.s3a.block.size", "134217728")     # 128MB blocks
    # Print ALL S3A related configurations
    hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
    # hadoop_conf.set("fs.s3a.max-keys", 1000)
    # hadoop_conf.set("fs.s3a.list.version.max-keys", "1000") 
    # hadoop_conf.set("fs.s3a.paging.maximum", "1000") 
    # hadoop_conf.set("fs.s3a.listing.max-keys", "1000")
    hadoop_conf.set("fs.s3a.list.version", "1") 
    hadoop_conf.unset("fs.s3a.paging.maximum")  # Completely remove this setting

Nothing seems to be working.

preguntada hace 2 meses36 visualizaciones
1 Respuesta
0

This error indicates an issue with the S3 request parameter "max-keys" when your PySpark job is trying to list objects in the S3 bucket. The error message states that "max-keys must be between 1 and 1000," suggesting there's a configuration problem with how your Spark application is interacting with S3.

Based on your attempts, I notice a few potential issues:

  1. Configuration timing: You're setting the configurations after the Spark session has already been created. These configurations should be set when initializing the SparkSession.

  2. Configuration method: You're using a mix of spark.conf.set() and hadoop_conf.set() methods, which might be causing inconsistency.

  3. Configuration naming: The exact property name might be different depending on your Hadoop/S3A version.

Try the following approach:

# Set configurations when creating the SparkSession
spark = SparkSession.builder \
.appName("YourApp") \
.config("spark.hadoop.fs.s3a.connection.maximum", "100") \
.config("spark.hadoop.fs.s3a.max.total.tasks", "100") \
.config("spark.hadoop.fs.s3a.list.version", "1") \
.config("spark.hadoop.fs.s3a.max-keys", "1000") \
.config("spark.hadoop.fs.s3a.list.version.max-keys", "1000") \
.config("spark.hadoop.fs.s3a.listing.max-keys", "1000") \
.getOrCreate()

Additionally, check the following:

  1. Verify that the S3 bucket "test" exists and that your EMR cluster has the proper IAM permissions to access it.

  2. Ensure that the path structure is correct. The error might also occur if the specified path doesn't exist.

  3. Check if there are any bucket policies or access control settings that might be interfering with your access.

  4. If you're using a VPC endpoint for S3, verify that it's correctly configured.

If the issue persists, you might want to try using the s3:// protocol instead of s3a:// as they use different implementations for accessing S3.
Sources
Glue Error: error occurred while calling o228.pyWriteDynamicFrame. com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist | AWS re:Post
5.2 - Spark troubleshooting and performance tuning | AWS Open Data Analytics
Troubleshoot Amazon S3 errors from AWS SDK exceptions | AWS re:Post

respondido hace 2 meses
  • Tried this configs are not working!

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.