Handling null array types in Athena

0

I am using an Athena table sourced from parquet files in s3. The table consists mostly of columns of type array<double>. To create the files, I am converting pandas dataframes into the parquet files. When I try to query the table, I am receiving a HIVE error that states:

"HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://<bucket>/<key>.parquet (offset=0, length=1700612): org.apache.parquet.io.PrimitiveColumnIO cannot be cast to org.apache.parquet.io.GroupColumnIO"

I have narrowed down the issue to being one of the array<double> column that includes Python None values. I believe Athena is trying to convert those empty values to an array, and that is where the issue arises. I am able to resolve this by using empty an empty list [] instead of None in the pandas column. However, we would prefer for these values to show up as null in Athena rather than as an empty list. Is there any way to have the empty values work with Athena while being null rather than an empty list?

Thank you

posta 2 anni fa2185 visualizzazioni
1 Risposta
1

LazySimpleSerDe will interpret only '\N' as NULL by default, but you can configure it to use other strings with the serialization.null.format serde property. Use NULL DEFINED AS '' i.e an <EMPTY STRING>

profile pictureAWS
con risposta 2 anni fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande