Glue ETL: cross region write to Iceberg failing

0

I have an Glue ETL job that runs in one region (eu-central-1) and successfully reads source data from an S3 bucket in a different region (us-east-1). I would like to write the output of the transformations I am performing to an S3 location in the same region as the source (us-east-1).

This works for the following formats:

  • Parquet
  • Hudi
  • Delta

It does not work for Iceberg, which is the format I require.

Here is my code (created through the visual editor):

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Script generated for node AWS Glue Data Catalog
AWSGlueDataCatalog_node1711553291421 = glueContext.create_dynamic_frame.from_catalog(database="raw", table_name="us_source", transformation_ctx="AWSGlueDataCatalog_node1711553291421")

# Script generated for node Amazon S3
additional_options = {"write.parquet.compression-codec": "gzip"}
tables_collection = spark.catalog.listTables("bronze")
table_names_in_db = [table.name for table in tables_collection]
table_exists = "sandbox_us_iceberg" in table_names_in_db
if table_exists:
    AmazonS3_node1711553556115_df = AWSGlueDataCatalog_node1711553291421.toDF()
    AmazonS3_node1711553556115_df        .writeTo("glue_catalog.bronze.sandbox_us_iceberg") \
        .tableProperty("format-version", "2") \
        .tableProperty("location", "s3://datalake-bronze/sensitive_data/sandbox_us/bronze/sandbox_us_iceberg") \
        .options(**additional_options) \
.append()
else:
    AmazonS3_node1711553556115_df = AWSGlueDataCatalog_node1711553291421.toDF()
    AmazonS3_node1711553556115_df        .writeTo("glue_catalog.bronze.sandbox_us_iceberg") \
        .tableProperty("format-version", "2") \
        .tableProperty("location", "s3://datalake-bronze/sensitive_data/sandbox_us/bronze/sandbox_us_iceberg") \
        .options(**additional_options) \
.create()

job.commit()

The error message I receive is: Error Category: UNCLASSIFIED_ERROR; Failed Line Number: 36; An error occurred while calling o143.create. Writing job aborted. Note: This run was executed with Flex execution. Check the logs if run failed due to executor termination. which is the line to create the Iceberg table.

The bottom of the stack trace provided looks like this:

	Suppressed: software.amazon.awssdk.services.s3.model.S3Exception: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint. (Service: S3, Status Code: 301, Request ID: HT0GMV7SHAZS65SA, Extended Request ID: pwBirv0jZiIDZCakLOufsj4M/dySVQHm8Vbl5zHu4R136Kw+c0EAvGYfiUSKBlDIbeWXq2SatSA=)
		at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleErrorResponse(AwsXmlPredicatedResponseHandler.java:156)
		at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleResponse(AwsXmlPredicatedResponseHandler.java:108)
		at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:85)
		at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:43)
		at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler$Crc32ValidationResponseHandler.handle(AwsSyncClientHandler.java:95)
		at software.amazon.awssdk.core.internal.handler.BaseClientHandler.lambda$successTransformationResponseHandler$7(BaseClientHandler.java:245)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:40)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:30)
		at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:73)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:50)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:36)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
		at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
		at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
		at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:48)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:31)
		at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
		at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
		at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:193)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:167)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:82)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:175)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:76)
		at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
		at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:56)
		at software.amazon.awssdk.services.s3.DefaultS3Client.deleteObject(DefaultS3Client.java:2532)
		at org.apache.iceberg.aws.s3.S3FileIO.deleteFile(S3FileIO.java:153)
		at org.apache.iceberg.spark.source.SparkWrite.lambda$deleteFiles$2(SparkWrite.java:681)
		at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:402)
		at org.apache.iceberg.util.Tasks$Builder.access$300(Tasks.java:68)
		at org.apache.iceberg.util.Tasks$Builder$1.run(Tasks.java:308)
		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
		at java.util.concurrent.FutureTask.run(FutureTask.java:266)
		... 3 more
Caused by: java.io.IOException: Unable to read response payload, because service returned response code 301 to an Expect: 100-continue request. Using another HTTP client implementation (e.g. Apache) removes this limitation.
	... 64 more
Caused by: java.io.UncheckedIOException: java.net.ProtocolException: Server rejected operation
	at software.amazon.awssdk.utils.FunctionalUtils.asRuntimeException(FunctionalUtils.java:180)
	at software.amazon.awssdk.utils.FunctionalUtils.lambda$safeSupplier$4(FunctionalUtils.java:110)
	at software.amazon.awssdk.utils.FunctionalUtils.invokeSafely(FunctionalUtils.java:136)
	at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.lambda$tryGetInputStream$1(UrlConnectionHttpClient.java:315)
	at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.getAndHandle100Bug(UrlConnectionHttpClient.java:345)
	... 63 more
Caused by: java.net.ProtocolException: Server rejected operation
	at sun.net.www.protocol.http.HttpURLConnection.expect100Continue(HttpURLConnection.java:1277)
	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1356)
	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1317)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1526)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)
	at software.amazon.awssdk.utils.FunctionalUtils.lambda$safeSupplier$4(FunctionalUtils.java:108)
	... 66 more

So it looks like some sort of permission error, but I am unsure how to fix because, as I mentioned, other file formats are writing fine to this S3 location. Could it be an issue with the library used to write the Iceberg table as it mentions using a different HTTP client implementation (e.g. Apache) removes this limitation?

Any help would be appreciated to solve this issue. We had a hard requirement to use Iceberg as the format so we can update records through Athena SQL. We also have a goal to run all our ETL jobs in eu-central-1.

Vince
demandé il y a un mois139 vues
1 réponse
1
Réponse acceptée

That is because Iceberg is using S3 directly for some operations, instead of via Hadoop like the other formats do.
Try setting client.assume-role.region (even though you are not assuming, according to the docs, it should work), see: https://iceberg.apache.org/docs/latest/aws/#cross-account-and-cross-region-access

profile pictureAWS
EXPERT
répondu il y a un mois
AWS
INGÉNIEUR EN ASSISTANCE TECHNIQUE
vérifié il y a un mois
  • You are a wizard! Thank you!

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions