How can I access Amazon S3 Requester Pays buckets from AWS Glue, Amazon EMR, or Amazon Athena?

2 minute read
0

I want to access an Amazon Simple Storage Service (Amazon S3) Requester Pays bucket from AWS Glue, Amazon EMR, or Amazon Athena.

Short description

To access S3 buckets that have Requester Pays turned on, all requests to the bucket must have the Requester Pays header.

Resolution

AWS Glue

AWS Glue requests to Amazon S3 don't include the Requester Pays header by default. Without this header, an API call to a Requester Pays bucket fails with an AccessDenied exception. To add the Requester Pays header to an ETL script, use hadoopConfiguration().set() to turn on fs.s3.useRequesterPaysHeader on the GlueContext variable or the Apache Spark session variable.

GlueContext:

glueContext._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

Spark session:

spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

The following is an example of how to use the header in an ETL script. Replace the following values:

database_name: the name of your database your_table_name: the name of your table s3://awsdoc-example-bucket/path-to-source-location/: the path to the source bucket s3://awsdoc-example-bucket/path-to-target-location/: the path to the destination bucket

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")
# glueContext._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

##AWS Glue DynamicFrame read and write
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "your_database_name", table_name = "your_table_name", transformation_ctx = "datasource0")
datasource0.show()
datasink = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path":"s3://awsdoc-example-bucket/path-to-source-location/"}, format = "csv")

##Spark DataFrame read and write
df = spark.read.csv("s3://awsdoc-example-bucket/path-to-source-location/")
df.show()
df.write.csv("s3://awsdoc-example-bucket/path-to-target-location/")

job.commit()

Amazon EMR

Set the following property in /usr/share/aws/emr/emrfs/conf/emrfs-site.xml:

<property>
   <name>fs.s3.useRequesterPaysHeader</name>
   <value>true</value>
</property>

Athena

To allow workgroup members to query Requester Pays buckets, choose Enable queries on Requester Pays buckets in Amazon S3 when you create the workgroup. For more information, see Create a workgroup.


Related information

Downloading objects in Requester Pays buckets

How do I troubleshoot 403 Access Denied errors from Amazon S3?

AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago
4 Comments

Which version of EMR supports this hadoop configuration?

Will this work in EMR 6.7?

spark._jsc.hadoopConfiguration().set(
    "fs.s3.useRequesterPaysHeader", "true"
)
replied a year ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied a year ago

Hello, is this "useRequesterPaysHeader" configuration supported from Sagemaker studio/data wrangler settings?

AWS
replied 6 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 6 months ago