Athena Query Engine 3 Error

0

I am currently getting the following error in Athena query engine 3, this does not occur in Athena query engine 2.

HIVE_CURSOR_ERROR: Failed to read Parquet file: s3://s3-path/my_file.parquet

the query I run is:

SELECT parent_event_id, document_id, return_code, return_date_time, return_trace_number, effective_date_time FROM "db"."my_table" WHERE dt > current_date - interval '45' day

if return_date_time is commented out it works in v3, return_date_time contains dates in the format:

1754-08-30 22:43:41.129

2023-06-23 07:00:00.000

If the query is run over a period with no 1754 date values it works fine.

Colin
asked 10 months ago389 views
1 Answer
0

Generally, this error usually occurs when there is a mismatch between metadata schema and your actual data schema.

You can use parquet tool [1] to detect the data schema of the parquet file ( as coming in error) and get the same from table properties. Then compare it with your DDL statement. Parquet have their own data types and metadata. Even though we define the data type while creating the table but those differ from the file datatypes and they could matched with the data type in your table.

Also as you have mentioned that the query is running fine with v2 and facing with v3. But when ‘return_date_time’ is removed from the query it works fine. You have mentioned the values you are using for ‘return_date_time’ are as follows: 1754-08-30 22:43:41.129 2023-06-23 07:00:00.000

Further, if you are removing the dates starting with 1754, then it works fine in v3 also.

I have tried to replicate it at my end. I have created a sample parquet file and included a date time column with the above mentioned values. Then I have created crawler to crawl the file and generate the table. Upon querying that table in Athena I was able to get the results without any error.

Following are the outputs

=================================================

I have used the parquet-tools and parquet-cli to check the parquet file.

command used:

$ parquet-tools show my_parquet_file.parquet

Output:

+———+———+——---+----------+

| registration_dttm | id | first_name | last_name |

|------------------+------+-----------+---------+-------+

| 1754-08-30 22:43:41.129000 | 48 | Frances | Willis |

+-------+------+--------------+----------+-------------+

To check the schema of the parquet file, I have used following command:

command used:

$ parq my_parquet_file.parquet -s

Output:

Schema

<pyarrow._parquet.ParquetSchema object at 0x7f1a9c72cf50>

<pyarrow._parquet.ParquetSchema object at 0x7f1a9c72cf50>

required group field_id=-1 glue_schema {

optional int96 field_id=-1 registration_dttm;

optional int32 field_id=-1 id;

optional binary field_id=-1 first_name (String);

optional binary field_id=-1 last_name (String);

}

As you can see the registration_dttm has datatype of INT96 which is the datatype support in parquet files to store the datetime values. This value is then mapped by crawler, to timestamp type in table.

Table schema generated by crawler. Checked it from the Glue table properties:

Column name Data type

1 registration_dttm timestamp

2 id int

3 first_name string

4 last_name string

And on querying this table in Athena with both version v2 and v3, got the same results as below.

registration_dttm id first_name last_name

1754-08-30 22:43:41.129 48 Frances Willis

========================================================

Thus, might be the issue is causing due to the conditional statements or the other column data types. Hence you can try querying ‘return_date_time’ specifically by running the following command to get the more information.

SELECT return_date_time FROM "db"."my_table”;

Further, we can troubleshoot your table and data but we would require your file and execution query ID to check the internal logs. Thus to troubleshoot the issue further, you can create a support ticket with us in Athena queue. Please share the sample data file and the execution query ID in that ticket.

References

[1] https://stackoverflow.com/questions/33883640/how-do-i-get-schema-column-names-from-parquet-file

[2] https://medium.com/analytics-vidhya/parquet-timestamp-and-athena-query-85dba6ccc9c

AWS
SUPPORT ENGINEER
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions