How do I replicate an Iceberg table used with Athena SQL and Athena PySpark?


I need to replicate an iceberg datalake stored in S3 from one bucket to another. However, multi-region access point doesn't work with Athena table. And I don't see any pyspark procedure that could help me to clone all data and metadata to another bucket with another location.

I need to keep the differences between both tables as small as possible (e.g. replicate at XX minutes frequency), there will be only one active-writer and one passive-read-only table at a time, so replication should be handled only in one direction.

What should be the best solution on AWS? (I'm using iceberg and Athena SQL for a long time, but I'm a beginner with Athena PySpark / Spark...)

1 Answer

The easiest solution is to replicate the table using S3 replication but the replicate could have missing files for a few seconds and fail reads (if it's for DR or if readers are aware and retry, it might be ok) Or just recreating the destination table each time (like a snapshot).
The other way I would consider doing that is streaming from the Iceberg table so it takes the increments and moves then into the other table: (assuming you don't do deletions)

profile pictureAWS
answered 2 months ago
  • I tried the stream solution, but it's not compatible with Athena PySpark (I got an explicit error about Athena not supporting it). So I guess it would require an EMR to run without interruption... (and I also do some deleting through Athena SQL).

    The S3 replication solution is more interesting in my situation, however my requirement is to have high availability on another region (in case of region shutdown), so I would need to query a "disaster region iceberg table" simply. If I replicate the data, I will need to do this: 1- Call a lambda every time a .json file is deposited, and if it's from replication (using userIdentity.principalId to verify), patch the S3 bucket name in it, and rewrite it. 2- This lambda should also update the "disaster iceberg table" glue entry to point its "metadata_location" to this current metadata json (and ideally patch the previous_metadata_location with the previous value), and the table schema in case of evolution (new columns, ...).

    I think this might work with this solution... But is there anything else I should be aware of? (due to replication delay/...) or maybe there's an easier solution?

    If there is latency before the disaster table is queryiable, it may not be a problem for me, and if I can't write to the disaster either (eventually I could maybe change the replication direction and have the same logic on the other region in case of)

  • Because Iceberg doesn't update file, you just delay a bit the table update so the data is consistent. The issue with the disaster happening while replicating you are always going to have it, unless when you write to the table you update both at the same time and use Nessie to coordinate it. IMO, it's not worth it and I would reconsider the design.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions