Redshift Spectrum and tables with nested columns in Hive Metastore

0

Hello,

I managed to connect Redshift to my Hive Metastore using an external schema. But it seems I can only query the tables that does not contain nested columns otherwise Redshift Spectrum doesn't recognize the columns.

Is there a way to fix this please? Or can I somehow create an external table in Redshift Spectrum and point it to the partitions from the Hive Metastore so that it will be in sync?

The table in Hive is like that:

CREATE TABLE `test`.`product_created` (`data` STRUCT<`id`: STRING, `type`: STRING>, `meta` STRUCT<`created_at`: BIGINT>, `year` INT, `month` INT, `day` INT, `hour` INT)
USING json
OPTIONS (
  `serialization.format` '1',
  path 's3a://my_bucket/product_created'
)
PARTITIONED BY (year, month, day, hour)

When I check in SVV_EXTERNAL_COLUMNS I see those columns:
col: array<string>
year: int
month: int
day: int
hour: int
And it gives an error if I try to query 'col' column.

I have the same issue with data in Parquet.
But it works for tables without nested columns like that:

CREATE TABLE `test`.`users` (`id` INT, `object_id` STRING, `username` STRING, `user_type` INT)
USING PARQUET
OPTIONS (
  `serialization.format` '1',
  path 's3a://my_bucket/users'
)

Thanks

Edited by: mpierre on Mar 13, 2020 6:44 AM

mpierre
asked 4 years ago720 views
1 Answer
0

I managed to fix the issues.

The first one with JSON is because I was creating the table with Spark and it's using a different SerDe by default.
I tried using this one instead and it works:

CREATE EXTERNAL TABLE `test`.`product_created` (`data` STRUCT<`id`: STRING, `type`: STRING>, `meta` STRUCT<`created_at`: BIGINT>)
ROW FORMAT SERDE &#39;org.openx.data.jsonserde.JsonSerDe&#39; LOCATION &#39;s3a://my_bucket/product_created&#39;
PARTITIONED BY (`year` INT, `month` INT, `day` INT, `hour` INT)

More info: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-data-catalog-hive.html

For the second error with Parquet files, it's strange, I just recreated the table in Hive with the same query and gave it a different name and it works well now.

mpierre
answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions