Redshift Spectrum and tables with nested columns in Hive Metastore

0

Hello,

I managed to connect Redshift to my Hive Metastore using an external schema. But it seems I can only query the tables that does not contain nested columns otherwise Redshift Spectrum doesn't recognize the columns.

Is there a way to fix this please? Or can I somehow create an external table in Redshift Spectrum and point it to the partitions from the Hive Metastore so that it will be in sync?

The table in Hive is like that:

CREATE TABLE `test`.`product_created` (`data` STRUCT<`id`: STRING, `type`: STRING>, `meta` STRUCT<`created_at`: BIGINT>, `year` INT, `month` INT, `day` INT, `hour` INT)
USING json
OPTIONS (
  `serialization.format` '1',
  path 's3a://my_bucket/product_created'
)
PARTITIONED BY (year, month, day, hour)

When I check in SVV_EXTERNAL_COLUMNS I see those columns:
col: array<string>
year: int
month: int
day: int
hour: int
And it gives an error if I try to query 'col' column.

I have the same issue with data in Parquet.
But it works for tables without nested columns like that:

CREATE TABLE `test`.`users` (`id` INT, `object_id` STRING, `username` STRING, `user_type` INT)
USING PARQUET
OPTIONS (
  `serialization.format` '1',
  path 's3a://my_bucket/users'
)

Thanks

Edited by: mpierre on Mar 13, 2020 6:44 AM

mpierre
preguntada hace 4 años733 visualizaciones
1 Respuesta
0

I managed to fix the issues.

The first one with JSON is because I was creating the table with Spark and it's using a different SerDe by default.
I tried using this one instead and it works:

CREATE EXTERNAL TABLE `test`.`product_created` (`data` STRUCT<`id`: STRING, `type`: STRING>, `meta` STRUCT<`created_at`: BIGINT>)
ROW FORMAT SERDE &#39;org.openx.data.jsonserde.JsonSerDe&#39; LOCATION &#39;s3a://my_bucket/product_created&#39;
PARTITIONED BY (`year` INT, `month` INT, `day` INT, `hour` INT)

More info: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-data-catalog-hive.html

For the second error with Parquet files, it's strange, I just recreated the table in Hive with the same query and gave it a different name and it works well now.

mpierre
respondido hace 4 años

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas