Athena cannot query unsigned bytes (uint8) from parquet files

0

Hi,

I have a S3 bucket that will act as a data lake with athena as query engine. A lot of the data is unsigned bytes.

I know that athena represents all integers as signed which would not be a problem if Athena assumed a number to be negative. However, whenever I try to query data that is if type unsigned and the value is above 128 the following error is received:

GENERIC_INTERNAL_ERROR: Value 174 exceeds MAX_BYTE

I have tried changing schema to smallint or even int but still the same error persists.

changing the type in the parquet file to a uint16 is not an option as the data will be in the size of terabytes eventually.

Is there any possible solution to this? Or should we just consider another database?

asked 17 days ago32 views
1 Answer
1
Accepted Answer

Unfortunately, Athena does not natively support unsigned byte (uint8) data types from Parquet files. This is a known limitation of Athena's data type handling.

The error you're encountering ("GENERIC_INTERNAL_ERROR: Value 174 exceeds MAX_BYTE") occurs because Athena is interpreting the unsigned byte values as signed bytes, which have a maximum value of 127. When it encounters values above this, like 174, it throws an error.

Changing the schema to smallint or int in Athena won't resolve the issue because the underlying data in the Parquet file is still stored as unsigned bytes. Athena is failing to properly interpret this data type.

Given that changing the data type in the Parquet files is not an option due to the large data size, you have a few potential workarounds to consider:

  1. Use a custom SerDe: You could potentially create a custom SerDe (Serializer/Deserializer) that correctly interprets the unsigned byte values. However, this would require significant development effort.

  2. Pre-process the data: You could set up a data pipeline that converts the unsigned byte values to a supported data type before querying with Athena. This could be done using AWS Glue or another ETL tool.

  3. Use a different query engine: Consider using a different query engine that supports unsigned bytes, such as Presto or Trino, which can be run on Amazon EMR.

  4. Use a different database: As you suggested, using a different database system that natively supports unsigned bytes might be the most straightforward solution if Athena's limitations are causing significant issues for your use case.

If querying this data is critical to your workflow and you need to stick with Athena, you may want to reach out to AWS support for further guidance or to request feature support for unsigned bytes in Athena.
Sources
Athena engine version 3 - Amazon Athena
Troubleshoot issues in Athena - Amazon Athena

profile picture
answered 17 days ago
profile picture
EXPERT
reviewed 17 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions