Handling null array types in Athena

0

I am using an Athena table sourced from parquet files in s3. The table consists mostly of columns of type array<double>. To create the files, I am converting pandas dataframes into the parquet files. When I try to query the table, I am receiving a HIVE error that states:

"HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://<bucket>/<key>.parquet (offset=0, length=1700612): org.apache.parquet.io.PrimitiveColumnIO cannot be cast to org.apache.parquet.io.GroupColumnIO"

I have narrowed down the issue to being one of the array<double> column that includes Python None values. I believe Athena is trying to convert those empty values to an array, and that is where the issue arises. I am able to resolve this by using empty an empty list [] instead of None in the pandas column. However, we would prefer for these values to show up as null in Athena rather than as an empty list. Is there any way to have the empty values work with Athena while being null rather than an empty list?

Thank you

preguntada hace 2 años2186 visualizaciones
1 Respuesta
1

LazySimpleSerDe will interpret only '\N' as NULL by default, but you can configure it to use other strings with the serialization.null.format serde property. Use NULL DEFINED AS '' i.e an <EMPTY STRING>

profile pictureAWS
respondido hace 2 años

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas