HIVE_CURSOR_ERROR: Failed to read Parquet file - when using a WHERE clause

10

Hello, we have a simple table with a few string columns and a few timestamp columns.

Recently we've noticed that any query performed on this table which has a WHERE clause, from the simplest to the most complex, ends up in an error: HIVE_CURSOR_ERROR: Failed to read Parquet file: s3://path/to/file/entity=my_table/123.parquet.

Every time, the parquet file ID mentioned in the error seems to be different.

The table was initially generated programatically, using a golang library. This used to work fine.

Can somebody offer more idea of how to investigate and fix this? Thanks

asked 6 months ago667 views
4 Answers
4

Hello,

Essentially, this issue occurs when the table DDL does not match the underlying data you are trying to access. In order to resolve this issue you might need to verify that your source file that has the correct schema to match your table. You can use parquet-tool to verify the schema if that imposed the same in Athena table as there might be source schema changed in the file that you read not reflected in the Table DDL.

Besides, you can also try to run the query in different Athena version to see if that works as you might aware that Athena V2 uses Presto and V3 uses Trino, there might be version level changes reflects this issue.

AWS
SUPPORT ENGINEER
answered 6 months ago
3
Accepted Answer

Hello, for others looking for a solution on this, if downgrading to Athena V2 is not possible, then you need to add the below statement in the SQL query which creates your problematic table. The root cause is an incompatibility between serialization/deserialization between Athena V2 and V3. Which is apparently a known issue, nevertheless, Amazon automatically upgraded our database to V3 which caused our functionality to break.

WITH SERDEPROPERTIES ('parquet.ignore.statistics'='true')
answered 6 months ago
profile picture
EXPERT
reviewed 2 months ago
3

Hello Yokesh, thank you for the answer. The pointers you gave are great. we checked the table schema and also the data from the parquet files, they look both correct and matching. So I would say that this lead is probably not our root cause.

We also did notice that our Athena version has been automatically upgraded to version 3, and we suspect this might have to do with the problem, although it is not clear to say how and why.

Further, I have two more questions:

  • How can we downgrade back to version 2? This seems to not be possible, do you know what are the steps?
  • What else can we try, if downgrading to version 2 is not possible?

Many thanks

answered 6 months ago
  • Hi Alexandru, Did you find the solution to the above problem? - I'm facing same issue. Downgrading to Athena engine V2 is not possible as AWS has removed that option from selecting.

  • Hello @Goutam, we did find a solution which wasn't provided here. The solution was to add this statement in the table creation SQL query.

    WITH SERDEPROPERTIES ('parquet.ignore.statistics'='true')

  • Thanks a lot Alexandru :) It is working fine now.

    Didn't expect this from AWS to upgrade Athena engine and have problem like this. This solution is not documented anywhere too. Atleast they should have updated in the doc where they talk about the breaking changes for Athena engine V3 https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference-0003.html#engine-versions-reference-0003-breaking-changes

    Now that you have mentioned as answer so people can refer to it. Much appreciated!

3

Hello Alexandru,

Presto is the basis for Athena v2's engine, and it is less strict with the data types. Trino, which Athena v3 utilizes, is far more explicit. As you verified the schema/datatypes using parquet-tool and still face the same problem, then I would suggest to change the version and try. Some cases, Athena V2 provides more error information which might be helpful to debug the issue. Basically you can follow this step mentioned in the document like below,

To manually choose an engine version,

  1. Open the Athena console at https://console.aws.amazon.com/athena/

  2. In the console navigation pane, choose Workgroups.

  3. In the list of workgroups, choose the link for the workgroup that you want to configure. Choose Edit.

  4. In the Query engine version section, for Update query engine, choose Manual to manually choose an engine version.

  5. Use the Query engine version option to choose the engine version that you want the workgroup to use. If a different engine version is unavailable, a different engine version cannot be specified. Choose Save changes.

  6. In the list of workgroups, the Query engine update status for the workgroup shows Manual.

If changing the engine version does not resolve the issue, this means that the schema is different even between parquet files, and the schema will need to be enforced when running the select query collectively. If still issue persist after the above tries, then I recommend to reach us through AWS Support with mentioning the query-id and region to provide more specific assistance.

AWS
SUPPORT ENGINEER
answered 6 months ago
  • Hello Yokesh, I'm too facing same issue. The schema/data type part looks fine. We can't change to Athena Engine V2 since it's removed from the selection. Now only Athena engine V3 is available.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions