BUG: Athena Engine Version 3 cannot handle timestamps before epoch time

0

Background

We were getting a "HIVE_CURSOR_ERROR: Failed to read Parquet file" when running trying to run an athena query using SELECT * FROM mydb. Our underlying data that we were querying was stored as parquet files on s3. However, when we queried for specific columns we did not see the error. We also tested reading the parquet file that the HIVE_CURSOR_ERROR was complaining about locally using pandas and parquet-tools. Reading the file in locally showed nothing out of the ordinary and our schema matched with the database schema in glue.

Reproducing

This error can be reproduced. First you'll want to create a parquet file that is queryable through Athena. This can be done by using the following script:

import awswrangler as wr
import pandas as pd

df = pd.DataFrame(
    {"test_time": [pd.Timestamp.min, pd.Timestamp.min]}
)

wr.s3.to_parquet(
    df,
    path="s3://my-bucket/athena_data/epoch_test",
    database="test-database",
    table="epoch",
    dataset=True
)

I'm using the minimum pandas timestamp, but it also fails with timestamps up to 1969. After creating the dataset, you can query it with SELECT * FROM "test-database"."epoch". Trying to query it using Athena Engine Version 2 results in no error, but querying it using Athena Engine Version 3 results in the following error: Athena Error Message Ideally we'd like to be able to run these queries using Athena Engine Version 3. If there is a workaround for this, I'd love to know what it is.

Thank You, Michael French

1 Answer
0

Hello Michael,

I have quickly created the data file and the Athena table using the python script that you have provided and faced with the same error as you with Athena v3.

On further research, I did check Athena documentation on this, and it identifies the cause as: a timestamp overflow for the int96 Parquet format. It is located at this documentation page [1], and you can find it easier by searching for "timeOfDayNanos".

-The suggested workaround from the docs is to identify the specific files that have the issue and generate the data file again with an up-to-date, well known Parquet library, or use Athena CTAS [2].

Checking further through various platforms, I have found the following information regarding this issue:

-Int96 timestamps are encoded as 12 byte arrays in parquet. The first 8 bytes are used to encode the nanoseconds into the day, and the remaining 4 bytes are used to encode the "Julian day" (basically, seconds since epoch). Here's the code for reference:

public static DecodedTimestamp decodeInt96Timestamp(Binary timestampBinary) { if (timestampBinary.length() != 12) { throw new TrinoException(NOT_SUPPORTED, "Parquet timestamp must be 12 bytes, actual " + timestampBinary.length()); } byte[] bytes = timestampBinary.getBytes();

    // little endian encoding - need to invert byte order
    long timeOfDayNanos = Longs.fromBytes(bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
    int julianDay = Ints.fromBytes(bytes[11], bytes[10], bytes[9], bytes[8]);

    return decodeInt96Timestamp(timeOfDayNanos, julianDay);
}

-This then cals decodeInt96Timestamp, which performs the following validation:

public static DecodedTimestamp decodeInt96Timestamp(long timeOfDayNanos, int julianDay) { verify(timeOfDayNanos >= 0 && timeOfDayNanos < NANOSECONDS_PER_DAY, "Invalid timeOfDayNanos: %s", timeOfDayNanos);

    long epochSeconds = (julianDay - JULIAN_EPOCH_OFFSET_DAYS) * SECONDS_PER_DAY + timeOfDayNanos / NANOSECONDS_PER_SECOND;
    return new DecodedTimestamp(epochSeconds, (int) (timeOfDayNanos % NANOSECONDS_PER_SECOND));
}

-That verify statement in the above snippet is failing for your data, because the timeOfDayNanos value is negative. I suspect this is the issue

-If your actual fdata is going to have the same kind of data columns, I suggest you check the the same on your end.

Resources: [1] - https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference-0003.html#engine-versions-reference-0003-breaking-changes [2] Creating a table from query results (CTAS) - https://docs.aws.amazon.com/athena/latest/ug/ctas.html

AWS
Aravind
answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions