Glue ETL: cross region write to Iceberg failing

Question

I have an Glue ETL job that runs in one region (`eu-central-1`) and successfully reads source data from an S3 bucket in a different region (`us-east-1`).  I would like to write the output of the transformations I am performing to an S3 location in the same region as the source (`us-east-1`).

This works for the following formats:
* `Parquet`
* `Hudi`
* `Delta`

It does not work for `Iceberg`, which is the format I require.

Here is my code (created through the visual editor):
```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Script generated for node AWS Glue Data Catalog
AWSGlueDataCatalog_node1711553291421 = glueContext.create_dynamic_frame.from_catalog(database="raw", table_name="us_source", transformation_ctx="AWSGlueDataCatalog_node1711553291421")

# Script generated for node Amazon S3
additional_options = {"write.parquet.compression-codec": "gzip"}
tables_collection = spark.catalog.listTables("bronze")
table_names_in_db = [table.name for table in tables_collection]
table_exists = "sandbox_us_iceberg" in table_names_in_db
if table_exists:
    AmazonS3_node1711553556115_df = AWSGlueDataCatalog_node1711553291421.toDF()
    AmazonS3_node1711553556115_df        .writeTo("glue_catalog.bronze.sandbox_us_iceberg") \
        .tableProperty("format-version", "2") \
        .tableProperty("location", "s3://datalake-bronze/sensitive_data/sandbox_us/bronze/sandbox_us_iceberg") \
        .options(**additional_options) \
.append()
else:
    AmazonS3_node1711553556115_df = AWSGlueDataCatalog_node1711553291421.toDF()
    AmazonS3_node1711553556115_df        .writeTo("glue_catalog.bronze.sandbox_us_iceberg") \
        .tableProperty("format-version", "2") \
        .tableProperty("location", "s3://datalake-bronze/sensitive_data/sandbox_us/bronze/sandbox_us_iceberg") \
        .options(**additional_options) \
.create()

job.commit()
```
The error message I receive is:
`Error Category: UNCLASSIFIED_ERROR; Failed Line Number: 36; An error occurred while calling o143.create. Writing job aborted. Note: This run was executed with Flex execution. Check the logs if run failed due to executor termination.`  which is the line to create the Iceberg table.

The bottom of the stack trace provided looks like this:
```
	Suppressed: software.amazon.awssdk.services.s3.model.S3Exception: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint. (Service: S3, Status Code: 301, Request ID: HT0GMV7SHAZS65SA, Extended Request ID: pwBirv0jZiIDZCakLOufsj4M/dySVQHm8Vbl5zHu4R136Kw+c0EAvGYfiUSKBlDIbeWXq2SatSA=)
		at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleErrorResponse(AwsXmlPredicatedResponseHandler.java:156)
		at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleResponse(AwsXmlPredicatedResponseHandler.java:108)
		at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:85)
		at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:43)
		at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler$Crc32ValidationResponseHandler.handle(AwsSyncClientHandler.java:95)
		at software.amazon.awssdk.core.internal.handler.BaseClientHandler.lambda$successTransformationResponseHandler$7(BaseClientHandler.java:245)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:40)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:30)
		at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:73)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:50)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:36)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
		at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
		at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
		at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:48)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:31)
		at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
		at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
		at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:193)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:167)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:82)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:175)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:76)
		at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
		at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:56)
		at software.amazon.awssdk.services.s3.DefaultS3Client.deleteObject(DefaultS3Client.java:2532)
		at org.apache.iceberg.aws.s3.S3FileIO.deleteFile(S3FileIO.java:153)
		at org.apache.iceberg.spark.source.SparkWrite.lambda$deleteFiles$2(SparkWrite.java:681)
		at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:402)
		at org.apache.iceberg.util.Tasks$Builder.access$300(Tasks.java:68)
		at org.apache.iceberg.util.Tasks$Builder$1.run(Tasks.java:308)
		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
		at java.util.concurrent.FutureTask.run(FutureTask.java:266)
		... 3 more
Caused by: java.io.IOException: Unable to read response payload, because service returned response code 301 to an Expect: 100-continue request. Using another HTTP client implementation (e.g. Apache) removes this limitation.
	... 64 more
Caused by: java.io.UncheckedIOException: java.net.ProtocolException: Server rejected operation
	at software.amazon.awssdk.utils.FunctionalUtils.asRuntimeException(FunctionalUtils.java:180)
	at software.amazon.awssdk.utils.FunctionalUtils.lambda$safeSupplier$4(FunctionalUtils.java:110)
	at software.amazon.awssdk.utils.FunctionalUtils.invokeSafely(FunctionalUtils.java:136)
	at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.lambda$tryGetInputStream$1(UrlConnectionHttpClient.java:315)
	at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.getAndHandle100Bug(UrlConnectionHttpClient.java:345)
	... 63 more
Caused by: java.net.ProtocolException: Server rejected operation
	at sun.net.www.protocol.http.HttpURLConnection.expect100Continue(HttpURLConnection.java:1277)
	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1356)
	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1317)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1526)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)
	at software.amazon.awssdk.utils.FunctionalUtils.lambda$safeSupplier$4(FunctionalUtils.java:108)
	... 66 more
```
So it looks like some sort of permission error, but I am unsure how to fix because, as I mentioned, other file formats are writing fine to this S3 location.  Could it be an issue with the library used to write the Iceberg table as it mentions using a different `HTTP client implementation (e.g. Apache) removes this limitation`?

Any help would be appreciated to solve this issue.  We had a hard requirement to use Iceberg as the format so we can update records through Athena SQL.  We also have a goal to run all our ETL jobs in `eu-central-1`.

Accepted Answer

That is because Iceberg is using S3 directly for some operations, instead of via Hadoop like the other formats do.   
Try setting client.assume-role.region (even though you are not assuming, according to the docs, it should work), see: 
https://iceberg.apache.org/docs/latest/aws/#cross-account-and-cross-region-access

Glue ETL: cross region write to Iceberg failing

相关内容