Handling null array types in Athena

0

I am using an Athena table sourced from parquet files in s3. The table consists mostly of columns of type array<double>. To create the files, I am converting pandas dataframes into the parquet files. When I try to query the table, I am receiving a HIVE error that states:

"HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://<bucket>/<key>.parquet (offset=0, length=1700612): org.apache.parquet.io.PrimitiveColumnIO cannot be cast to org.apache.parquet.io.GroupColumnIO"

I have narrowed down the issue to being one of the array<double> column that includes Python None values. I believe Athena is trying to convert those empty values to an array, and that is where the issue arises. I am able to resolve this by using empty an empty list [] instead of None in the pandas column. However, we would prefer for these values to show up as null in Athena rather than as an empty list. Is there any way to have the empty values work with Athena while being null rather than an empty list?

Thank you

1回答
1

LazySimpleSerDe will interpret only '\N' as NULL by default, but you can configure it to use other strings with the serialization.null.format serde property. Use NULL DEFINED AS '' i.e an <EMPTY STRING>

profile pictureAWS
回答済み 2年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ