Skip to content

Glue ETL job for Parquet data located on s3 bucket enabled with requester pays

0

I have an s3 bucket in one AWS account that has data in Apache Parquet format. I need to give access to many AWS accounts without being bound by IAM policy limits (10,248 character max), and also require the consuming accounts to pay for the s3 usage. I created a s3 access point and set the s3 bucket policy to delegate the rights to the access point. I then gave access on the access point to the other accounts (GetObject, GetObjectVersion, ListBucket as the requirement is to grant read-access only). On the consuming account, I am able to create a glue catalog and query the data using Athena SQL console, where I used a workgroup with requester pays enabled. When I tried to run a Glue ETL job (to do some advanced processing), I get access denied on the s3 files that Glue tries to access through the access point. I followed the AWS Glue approach documented in the re:Post article https://repost.aws/knowledge-center/requester-pays-buckets-glue-emr-athena and it doesn't work in my scenario for access point based s3 bucket with requester pays enabled. The error that I get is: An error occurred while calling o131.parquet. com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied;

I do not have a cross-region scenario, they are the same region in both the accounts.

I tried the spark.sql.catalog.{catalog_name}.s3.access-points.{bucketName}=arn:aws:s3:{region}:{account}:accesspoint/{s3accessPointName}) setting also and it did not help. When using the accesspoint config, I get the following error: An error occurred while calling o135.getDynamicFrame. No such file or directory.

Any idea how to make this work? Is this supported in Glue?

asked 3 months ago97 views
4 Answers
0
Accepted Answer

You’re absolutely correct that when a bucket’s policy delegates access control to an S3 Access Point, you generally shouldn’t need to attach explicit bucket-level permissions for the consuming accounts. In theory, the access point policy should act as the single source of truth. However, in practice, AWS Glue’s runtime behavior for requester pays access via access points isn’t fully consistent with that model yet.

When Glue launches an ETL job, it internally initializes multiple Spark executors that interact directly with S3 using the Hadoop S3A client. The challenge is that Glue’s underlying S3 client doesn’t always resolve the delegated access point policy correctly during the requester pays handshake. Instead, it still attempts to verify permissions on the underlying bucket resource ARN, even when the bucket policy explicitly delegates control to the access point. This is why you’re seeing the 403 AccessDenied response despite correct access point configuration.

Here are a few approaches that have worked in similar cross-account setups:

Add explicit bucket-level read permissions to the Glue job role even if delegation is configured. This doesn’t violate the access point model but compensates for the current behavior of the Glue runtime’s S3 client. Use a scoped statement limited to the specific bucket ARN and restrict to GetObject, GetObjectVersion, and ListBucket.

Confirm the Glue job role has permission to call s3:GetAccessPoint and s3:GetAccessPointPolicy. These permissions are sometimes overlooked but are required when Glue needs to resolve access point aliases during initialization.

Verify the URI structure. The Hadoop client expects the access point alias or full ARN in the format:

s3://<access-point-name>-<account-id>.s3-accesspoint.<region>.amazonaws.com/<prefix>/

Using only s3://accesspoint/[accessPointName]/ can sometimes fail resolution inside Glue depending on the SDK version.

Ensure the requester pays header is applied globally. Set both:

spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader", "true") spark._jsc.hadoopConfiguration().set("fs.s3a.requester.pays", "true")

The second property is used by the newer S3A client, and both may be needed depending on the Glue version.

If you’re running Glue 4.0 or earlier, the internal SDK for S3 access points doesn’t fully handle cross-account requester pays buckets. Upgrading to Glue 5.0, which uses a newer AWS SDK for Java, improves consistency with access points and requester pays headers.

In the long term, AWS is expected to harmonize Glue’s S3 client behavior with the same model used in Athena and EMR, where the access point delegation is respected transparently. For now, the safest path is to explicitly grant read access on the bucket to the Glue job role, even if that seems redundant.

For reference, check:

Using S3 access points with AWS Glue : https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect-s3.html

Requester Pays buckets in Amazon S3 : https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html

AWS Glue job properties for Spark and Hadoop : https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html

It’s a nuanced setup, and your observation about access point delegation is absolutely right. This looks more like a client behavior gap than a permissions misconfiguration

answered 3 months ago
AWS
EXPERT
reviewed 3 months ago
0

To enable a Glue ETL job to access Parquet data in a requester pays S3 bucket through an access point, you need to configure specific Spark properties in your Glue job.

For requester pays buckets, you need to add the following Spark property to your Glue job configuration:

"spark.hadoop.fs.s3.useRequesterPaysHeader": "true"

This property instructs the Glue job to include the necessary x-amz-request-payer header in S3 requests, which is required when accessing requester pays buckets.

When using S3 access points, you also need to ensure your IAM role has the proper permissions to access both the S3 bucket and the access point. The role used by your Glue job should have permissions for GetObject, GetObjectVersion, and ListBucket operations on both the bucket and the access point.

For accessing data through an access point, you should use the access point ARN or alias in your Glue job configuration. The format for accessing data through an access point would be:

s3://accesspoint/[accesspointname]/[optional-prefix]

If you're still experiencing issues, verify that:

  1. The IAM role used by your Glue job has the necessary permissions
  2. The requester pays header is properly configured
  3. The access point resource policy correctly allows the consuming account's Glue service role

In Athena for Spark, requester pays buckets are enabled per session, not per workgroup, which is different from how regular Athena SQL queries work. This might explain why your Athena queries work but Glue ETL jobs don't.
Sources
Configure DMS S3 Target Endpoint for Requester Pays Bucket | AWS re:Post
Enable requester pays Amazon S3 buckets in Athena for Spark - Amazon Athena

answered 3 months ago
0

The automated answer says this

"When using S3 access points, you also need to ensure your IAM role has the proper permissions to access both the S3 bucket and the access point. The role used by your Glue job should have permissions for GetObject, GetObjectVersion, and ListBucket operations on both the bucket and the access point."

If the underlying s3 bucket delegates access control to the access point, which is my case, that should work and I shouldn't have to grant access to the other accounts directly on the s3 bucket. Requesting someone to help clarify this. I have used the spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true") in the Glue job as described in this re:Post article https://repost.aws/knowledge-center/requester-pays-buckets-glue-emr-athena

answered 3 months ago
0

Thank you very much Hawke for the fast and thorough explanation. I understand now that Glue client has to evolve to support access points comprehensively. A quick test of providing access to the underlying bucket succeeded. I have accepted your answer. Thanks.

answered 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.