Athena Delta Support doesn't seem to work for absolute paths in delta manifest

0

Hello,

First of all, thank you for adding first class support for Delta tables in Athena!

I'm trying to query a delta table from Athena engine version 3 where the table was created by doing a SHALLOW CLONE operation within spark sql. That creates a new delta manifest log where the paths are absolute, rather than relative.

Where the original table manifest has a relative path like:

{
  "add": {
    "path": "part-00156-c812f51c-c290-499c-b3b5-f33642e8b428.c000.snappy.parquet",
  ...
}

The cloned table might live at s3://bucket/cloned_table/ and have a manifest entry where the paths are absolute like this:

{
  "add": {
    "path": "s3://bucket/original_table/part-00156-c812f51c-c290-499c-b3b5-f33642e8b428.c000.snappy.parquet",
  ...
}

To be clear, these are add entries in the delta transaction manifest like _delta_log/xxxxxx.json, not symlink_file_manifest files.

When I run an Athena query against the cloned table I get an error like:

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split "s3://bucket/cloned_table/s3://bucket/original_table/part-00156-c812f51c-c290-499c-b3b5-f33642e8b428.c000.snappy.parquet (offset=0, length=67108864): io.trino.plugin.hive.s3.TrinoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: XXX; S3 Extended Request ID: XXX; Proxy: null), S3 Extended Request ID: XXX (Path: s3://bucket/cloned_table/s3://bucket/original_table/part-00156-c812f51c-c290-499c-b3b5-f33642e8b428.c000.snappy.parquet ...)

I'm assuming that there's a limitation/bug in the Athena delta manifest handling-- it should recognize an absolute path and not append it to the table base location, but please let me know if I'm mistaken or if there's a workaround.

Note that the delta protocol specification does allow for absolute paths as documented here: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-file-and-remove-file

Thanks!

asked a year ago292 views
1 Answer
-1

I love this tool as well, It looks like you are running into an issue with the Athena query engine when querying a Delta Lake table that was created using a SHALLOW CLONE operation in Spark SQL. The issue is caused by the fact that the Delta transaction manifest for the cloned table contains absolute paths for the data files, rather than relative paths. This issue appears to be a limitation or bug in the Athena query engine, as it is unable to correctly handle absolute paths in the Delta transaction manifest. Unfortunately, there is currently no workaround for this issue. One possible solution is to modify the Delta transaction manifest for the cloned table to use relative paths, rather than absolute paths. This can be done using a custom Spark SQL script or using the delta.util.convertToRelativePaths function in the Delta Lake API. @seekrsi

SeanSi
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions