Handling null array types in Athena

0

I am using an Athena table sourced from parquet files in s3. The table consists mostly of columns of type array<double>. To create the files, I am converting pandas dataframes into the parquet files. When I try to query the table, I am receiving a HIVE error that states:

"HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://<bucket>/<key>.parquet (offset=0, length=1700612): org.apache.parquet.io.PrimitiveColumnIO cannot be cast to org.apache.parquet.io.GroupColumnIO"

I have narrowed down the issue to being one of the array<double> column that includes Python None values. I believe Athena is trying to convert those empty values to an array, and that is where the issue arises. I am able to resolve this by using empty an empty list [] instead of None in the pandas column. However, we would prefer for these values to show up as null in Athena rather than as an empty list. Is there any way to have the empty values work with Athena while being null rather than an empty list?

Thank you

已提问 2 年前2188 查看次数
1 回答
1

LazySimpleSerDe will interpret only '\N' as NULL by default, but you can configure it to use other strings with the serialization.null.format serde property. Use NULL DEFINED AS '' i.e an <EMPTY STRING>

profile pictureAWS
已回答 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则