Amazon Athena HIVE_CANNOT_OPEN_SPLIT Error

0

We are using Tableau and Tableau has a schedule querying athena.

It worked well until yesterday but I got below issue today.

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://{our_bucket}/{folder1}/{table_name}/20240319/20240319.parquet (offset=0, length=15190461): start index (-4) must not be negative

Our parquet files are stored in S3 using partition projection.

So the '20240319' directory means yyyyMMdd.

the date condition of the query is like "date >= '20210101'" and if I use "date = '20240319'" or "date >= '20240101'" on the same query, it works well.

Do you guys have any idea about this kind of error?

I've never seen the same error before and on the entire community.

Please help us if you have similar experience.

Thank you.

1 Answer
0
Accepted Answer

The HIVE_CANNOT_OPEN_SPLIT error is typically seen when there is an issue reading data from a partition and its associated S3 prefix. There are several things to check here:

  • How many files are in the date prefix where you’re seeing the error?
    • There’s an S3 limitation on how many requests can be made per prefix. If you have many (e.g. thousands) smaller files in the partition you’re trying to query from, it is possible to hit this S3 limitation from Athena. I recommend merging those smaller files into larger files and/or distributing your data over another partition field to avoid this issue when querying in the future.
    • I would also view any S3 metrics in CloudWatch to see if S3 requests to your bucket and the specific prefix are being throttled.
  • Are other applications also reading from the same S3 prefix?
    • Here you will run into the same S3 limitation so I recommend, again, merging smaller files into larger files within the prefix and/or distributing your data over another partition field to avoid this issue when querying in the future.
  • Is versioning available on your S3 bucket?
    • If there’s a high number of files that have been deleted, you might also see this error. This is because S3 places a delete marker instead of deleting the objects when versioning is enabled. Athena will check the different versions to determine which objects to include in the query result. I recommend creating an S3 lifecycle configuration policy to automatically delete any object versions marked for deletion.

Hopefully, this helps resolves your issue.

profile pictureAWS
EXPERT
answered 12 days ago
  • Thank you for sharing possible reasons for this issue.

    I've checked the all things that you mentioned but it's still not clear.

    This bucket is not using versioning and the number of prefix is less than 1,500 while I heard that the limitation for GET requests per prefix is about 5,500.

    Fortunately, although I did nothing for this issue yet, the schedule is working well since the issue occur...

    Thank you very much anyway, and I should monitor some other metrics.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions