Salta al contenuto

com.amazonaws.services.s3.model.AmazonS3Exception: max-keys must be between 1 and 1000

0

Getting this issue in Amazon EMR during a pyspark job execution.

df = spark.read.parquet("s3a://test/raw-billing-cor-data/cur2/123456789/cid-cur2/data/BILLING_PERIOD=2025-08/")

py4j.protocol.Py4JJavaError: An error occurred while calling o93.parquet.
: org.apache.hadoop.fs.s3a.AWSBadRequestException: listStatus on s3a://test/raw-billing-cor-data/cur2/123456789/cid-cur2/data/BILLING_PERIOD=2025-08/: com.amazonaws.services.s3.model.AmazonS3Exception: max-keys must be between 1 and 1000 (Service: Amazon S3; Status Code: 400; Error Code: InvalidArgument; Request ID: null; S3 Extended Request ID: null; Proxy: null), S3 Extended Request ID: null:InvalidArgument: max-keys must be between 1 and 1000 (Service: Amazon S3; Status Code: 400; Error Code: InvalidArgument; Request ID: null; S3 Extended Request ID: null; Proxy: null)
        at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:249)

Spark version - 3.5.0 Tried setting below confs.

    spark.conf.set("spark.hadoop.fs.s3a.max-keys", "1000")
    spark.conf.set("spark.hadoop.fs.s3a.list.version.max-keys", "1000")
    spark.conf.set("spark.hadoop.fs.s3a.paging.maximum", "1000")
    spark.conf.set("spark.hadoop.fs.s3a.listing.max-keys", "1000")  # This one might be missing
    spark.conf.set("spark.hadoop.fs.s3a.multipart.size", "67108864")  # 64MB chunks
    spark.conf.set("spark.hadoop.fs.s3a.block.size", "134217728")     # 128MB blocks
    # Print ALL S3A related configurations
    hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
    # hadoop_conf.set("fs.s3a.max-keys", 1000)
    # hadoop_conf.set("fs.s3a.list.version.max-keys", "1000") 
    # hadoop_conf.set("fs.s3a.paging.maximum", "1000") 
    # hadoop_conf.set("fs.s3a.listing.max-keys", "1000")
    hadoop_conf.set("fs.s3a.list.version", "1") 
    hadoop_conf.unset("fs.s3a.paging.maximum")  # Completely remove this setting

Nothing seems to be working.

posta 2 mesi fa43 visualizzazioni
1 Risposta
0

This error indicates an issue with the S3 request parameter "max-keys" when your PySpark job is trying to list objects in the S3 bucket. The error message states that "max-keys must be between 1 and 1000," suggesting there's a configuration problem with how your Spark application is interacting with S3.

Based on your attempts, I notice a few potential issues:

  1. Configuration timing: You're setting the configurations after the Spark session has already been created. These configurations should be set when initializing the SparkSession.

  2. Configuration method: You're using a mix of spark.conf.set() and hadoop_conf.set() methods, which might be causing inconsistency.

  3. Configuration naming: The exact property name might be different depending on your Hadoop/S3A version.

Try the following approach:

# Set configurations when creating the SparkSession
spark = SparkSession.builder \
.appName("YourApp") \
.config("spark.hadoop.fs.s3a.connection.maximum", "100") \
.config("spark.hadoop.fs.s3a.max.total.tasks", "100") \
.config("spark.hadoop.fs.s3a.list.version", "1") \
.config("spark.hadoop.fs.s3a.max-keys", "1000") \
.config("spark.hadoop.fs.s3a.list.version.max-keys", "1000") \
.config("spark.hadoop.fs.s3a.listing.max-keys", "1000") \
.getOrCreate()

Additionally, check the following:

  1. Verify that the S3 bucket "test" exists and that your EMR cluster has the proper IAM permissions to access it.

  2. Ensure that the path structure is correct. The error might also occur if the specified path doesn't exist.

  3. Check if there are any bucket policies or access control settings that might be interfering with your access.

  4. If you're using a VPC endpoint for S3, verify that it's correctly configured.

If the issue persists, you might want to try using the s3:// protocol instead of s3a:// as they use different implementations for accessing S3.
Sources
Glue Error: error occurred while calling o228.pyWriteDynamicFrame. com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist | AWS re:Post
5.2 - Spark troubleshooting and performance tuning | AWS Open Data Analytics
Troubleshoot Amazon S3 errors from AWS SDK exceptions | AWS re:Post

con risposta 2 mesi fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.