Handling null array types in Athena

0

I am using an Athena table sourced from parquet files in s3. The table consists mostly of columns of type array<double>. To create the files, I am converting pandas dataframes into the parquet files. When I try to query the table, I am receiving a HIVE error that states:

"HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://<bucket>/<key>.parquet (offset=0, length=1700612): org.apache.parquet.io.PrimitiveColumnIO cannot be cast to org.apache.parquet.io.GroupColumnIO"

I have narrowed down the issue to being one of the array<double> column that includes Python None values. I believe Athena is trying to convert those empty values to an array, and that is where the issue arises. I am able to resolve this by using empty an empty list [] instead of None in the pandas column. However, we would prefer for these values to show up as null in Athena rather than as an empty list. Is there any way to have the empty values work with Athena while being null rather than an empty list?

Thank you

asked 2 years ago2114 views
1 Answer
1

LazySimpleSerDe will interpret only '\N' as NULL by default, but you can configure it to use other strings with the serialization.null.format serde property. Use NULL DEFINED AS '' i.e an <EMPTY STRING>

profile pictureAWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions