Glue ETL: cross region write to Iceberg failing

0

I have an Glue ETL job that runs in one region (eu-central-1) and successfully reads source data from an S3 bucket in a different region (us-east-1). I would like to write the output of the transformations I am performing to an S3 location in the same region as the source (us-east-1).

This works for the following formats:

  • Parquet
  • Hudi
  • Delta

It does not work for Iceberg, which is the format I require.

Here is my code (created through the visual editor):

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Script generated for node AWS Glue Data Catalog
AWSGlueDataCatalog_node1711553291421 = glueContext.create_dynamic_frame.from_catalog(database="raw", table_name="us_source", transformation_ctx="AWSGlueDataCatalog_node1711553291421")

# Script generated for node Amazon S3
additional_options = {"write.parquet.compression-codec": "gzip"}
tables_collection = spark.catalog.listTables("bronze")
table_names_in_db = [table.name for table in tables_collection]
table_exists = "sandbox_us_iceberg" in table_names_in_db
if table_exists:
    AmazonS3_node1711553556115_df = AWSGlueDataCatalog_node1711553291421.toDF()
    AmazonS3_node1711553556115_df        .writeTo("glue_catalog.bronze.sandbox_us_iceberg") \
        .tableProperty("format-version", "2") \
        .tableProperty("location", "s3://datalake-bronze/sensitive_data/sandbox_us/bronze/sandbox_us_iceberg") \
        .options(**additional_options) \
.append()
else:
    AmazonS3_node1711553556115_df = AWSGlueDataCatalog_node1711553291421.toDF()
    AmazonS3_node1711553556115_df        .writeTo("glue_catalog.bronze.sandbox_us_iceberg") \
        .tableProperty("format-version", "2") \
        .tableProperty("location", "s3://datalake-bronze/sensitive_data/sandbox_us/bronze/sandbox_us_iceberg") \
        .options(**additional_options) \
.create()

job.commit()

The error message I receive is: Error Category: UNCLASSIFIED_ERROR; Failed Line Number: 36; An error occurred while calling o143.create. Writing job aborted. Note: This run was executed with Flex execution. Check the logs if run failed due to executor termination. which is the line to create the Iceberg table.

The bottom of the stack trace provided looks like this:

	Suppressed: software.amazon.awssdk.services.s3.model.S3Exception: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint. (Service: S3, Status Code: 301, Request ID: HT0GMV7SHAZS65SA, Extended Request ID: pwBirv0jZiIDZCakLOufsj4M/dySVQHm8Vbl5zHu4R136Kw+c0EAvGYfiUSKBlDIbeWXq2SatSA=)
		at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleErrorResponse(AwsXmlPredicatedResponseHandler.java:156)
		at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleResponse(AwsXmlPredicatedResponseHandler.java:108)
		at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:85)
		at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:43)
		at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler$Crc32ValidationResponseHandler.handle(AwsSyncClientHandler.java:95)
		at software.amazon.awssdk.core.internal.handler.BaseClientHandler.lambda$successTransformationResponseHandler$7(BaseClientHandler.java:245)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:40)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:30)
		at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:73)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:50)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:36)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
		at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
		at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
		at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:48)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:31)
		at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
		at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
		at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:193)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:167)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:82)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:175)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:76)
		at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
		at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:56)
		at software.amazon.awssdk.services.s3.DefaultS3Client.deleteObject(DefaultS3Client.java:2532)
		at org.apache.iceberg.aws.s3.S3FileIO.deleteFile(S3FileIO.java:153)
		at org.apache.iceberg.spark.source.SparkWrite.lambda$deleteFiles$2(SparkWrite.java:681)
		at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:402)
		at org.apache.iceberg.util.Tasks$Builder.access$300(Tasks.java:68)
		at org.apache.iceberg.util.Tasks$Builder$1.run(Tasks.java:308)
		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
		at java.util.concurrent.FutureTask.run(FutureTask.java:266)
		... 3 more
Caused by: java.io.IOException: Unable to read response payload, because service returned response code 301 to an Expect: 100-continue request. Using another HTTP client implementation (e.g. Apache) removes this limitation.
	... 64 more
Caused by: java.io.UncheckedIOException: java.net.ProtocolException: Server rejected operation
	at software.amazon.awssdk.utils.FunctionalUtils.asRuntimeException(FunctionalUtils.java:180)
	at software.amazon.awssdk.utils.FunctionalUtils.lambda$safeSupplier$4(FunctionalUtils.java:110)
	at software.amazon.awssdk.utils.FunctionalUtils.invokeSafely(FunctionalUtils.java:136)
	at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.lambda$tryGetInputStream$1(UrlConnectionHttpClient.java:315)
	at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.getAndHandle100Bug(UrlConnectionHttpClient.java:345)
	... 63 more
Caused by: java.net.ProtocolException: Server rejected operation
	at sun.net.www.protocol.http.HttpURLConnection.expect100Continue(HttpURLConnection.java:1277)
	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1356)
	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1317)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1526)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)
	at software.amazon.awssdk.utils.FunctionalUtils.lambda$safeSupplier$4(FunctionalUtils.java:108)
	... 66 more

So it looks like some sort of permission error, but I am unsure how to fix because, as I mentioned, other file formats are writing fine to this S3 location. Could it be an issue with the library used to write the Iceberg table as it mentions using a different HTTP client implementation (e.g. Apache) removes this limitation?

Any help would be appreciated to solve this issue. We had a hard requirement to use Iceberg as the format so we can update records through Athena SQL. We also have a goal to run all our ETL jobs in eu-central-1.

Vince
已提问 1 个月前139 查看次数
1 回答
1
已接受的回答

That is because Iceberg is using S3 directly for some operations, instead of via Hadoop like the other formats do.
Try setting client.assume-role.region (even though you are not assuming, according to the docs, it should work), see: https://iceberg.apache.org/docs/latest/aws/#cross-account-and-cross-region-access

profile pictureAWS
专家
已回答 1 个月前
AWS
支持工程师
已审核 1 个月前
  • You are a wizard! Thank you!

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则