Redshift Spectrum and tables with nested columns in Hive Metastore

0

Hello,

I managed to connect Redshift to my Hive Metastore using an external schema. But it seems I can only query the tables that does not contain nested columns otherwise Redshift Spectrum doesn't recognize the columns.

Is there a way to fix this please? Or can I somehow create an external table in Redshift Spectrum and point it to the partitions from the Hive Metastore so that it will be in sync?

The table in Hive is like that:

CREATE TABLE `test`.`product_created` (`data` STRUCT<`id`: STRING, `type`: STRING>, `meta` STRUCT<`created_at`: BIGINT>, `year` INT, `month` INT, `day` INT, `hour` INT)
USING json
OPTIONS (
  `serialization.format` '1',
  path 's3a://my_bucket/product_created'
)
PARTITIONED BY (year, month, day, hour)

When I check in SVV_EXTERNAL_COLUMNS I see those columns:
col: array<string>
year: int
month: int
day: int
hour: int
And it gives an error if I try to query 'col' column.

I have the same issue with data in Parquet.
But it works for tables without nested columns like that:

CREATE TABLE `test`.`users` (`id` INT, `object_id` STRING, `username` STRING, `user_type` INT)
USING PARQUET
OPTIONS (
  `serialization.format` '1',
  path 's3a://my_bucket/users'
)

Thanks

Edited by: mpierre on Mar 13, 2020 6:44 AM

mpierre
已提問 4 年前檢視次數 733 次
1 個回答
0

I managed to fix the issues.

The first one with JSON is because I was creating the table with Spark and it's using a different SerDe by default.
I tried using this one instead and it works:

CREATE EXTERNAL TABLE `test`.`product_created` (`data` STRUCT<`id`: STRING, `type`: STRING>, `meta` STRUCT<`created_at`: BIGINT>)
ROW FORMAT SERDE &#39;org.openx.data.jsonserde.JsonSerDe&#39; LOCATION &#39;s3a://my_bucket/product_created&#39;
PARTITIONED BY (`year` INT, `month` INT, `day` INT, `hour` INT)

More info: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-data-catalog-hive.html

For the second error with Parquet files, it's strange, I just recreated the table in Hive with the same query and gave it a different name and it works well now.

mpierre
已回答 4 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南