Hi,
In Athena, in the Console, I am getting a HIVE_CURSOR_ERROR
when running:
(Queries, file paths etc. are anonymized)
SELECT DISTINCT file_path
FROM (
SELECT xid,
"$path" as file_path
FROM "xxxx"."yyyyyyyyyy"
WHERE partcountry = 'XY'
AND partyear in ('2023')
AND partmonth in ('2')
AND xid IN ('638odhjdk20dk')
)
-
Note: it has to open every single file and check the column xid
if 638odhjdk20dk
is in it. The column xid
is of type String
.
-
The error reads: HIVE_CURSOR_ERROR: Failed to read Parquet file: s3://xxxxxx/subfolderid=xyz/partyear=2023/partmonth=2/partday=3/part-00704-bd419050-hec9-42df-8093-7342djsl80d5b33.c000.snappy.parquet
-
This happened multiple times and always with different parquet files. But only for this country XY
which has a lot of data.
-
With other countries it did not happen. Those have much less data (always scanning less than 3TB for other countries).
-
In country XY
, it happens sometimes after it has scanned 5TB, sometimes after 10TB.
-
It only takes about 3 to 7 minutes to get to the error.
-
The workgroup I am using has no data scan limit.
-
I have used this article: https://repost.aws/knowledge-center/athena-hive-cursor-error to resolve it;
-
I have downloaded the file locally and opened it with pandas in python. There was no problem. So the file is not corrupted.
-
My hypothesis is that this has to do with Athena running very fast through many many many files and somewhere there is crash, perhaps because there is a lock on a file from one worker while another wants to open it.
-
Support for this hypothesis comes from this query, that focuses just on one day (partday=3
), and one subfolderid (subfolderid=xyz
) within partmonth=2; This is the day and subfolderid are where the "unreadable" file is located (see above).
SELECT DISTINCT file_path
FROM (
SELECT xid,
"$path" as file_path
FROM "xxxx"."yyyyyyyyyy"
WHERE partcountry = 'XY'
AND partyear in ('2023')
AND partmonth in ('2')
AND partday in ('3')
AND subfolderid='xyz'
AND xid IN ('638odhjdk20dk')
)
- This query just runs through about 306GB and completes without any problems.
Can anyone help here? We need Athena to be reliable with also such large queries.