Redshift unnested json string

0

Hi guy,

I have a json string like:

{
    "lastUpdatedBy": "cewu@abc.com",
    "clientName": {
        "Values": [
            {
                "Value": "Ce Wu",
                "Confidence": 100.0,
                "Source": "automation"
            }
        ],
        "Overall": {
            "Value": "Ce Wu",
            "Confidence": 100.0,
            "Source": "automation"
        }
    },
    "employerNumber": {
        "Values": [
            {
                "Value": "abc company",
                "Confidence": 90.0,
                "Source": "tm1@abc.com",
            },
            {
                "Value": "ABC company llc",
                "Confidence": 90.0,
                "Source": "tm2@abc.com"
            }
        ],
        "Overall": {
            "Value": "ABC company llc",
            "Confidence": 90.0,
            "Source": "tm2@abc.com"
        }
    },
    "employeeID":{
        ...
    },
    "employeeAddress":{
        ...
    },
    ...
   there could be 100+ sub columns nested in the json string
}

Is there a way in Redshift I could unnested Values data from the json string to:

|-------------------+--------------------+-------------------+------------------|
|lastUpdatedBy      |field               |value              |source            |
|-------------------+--------------------+-------------------+------------------|
|cewu@abc.com       |clientName          |Ce Wu              |automation.       |
|cewu@abc.com       |employerName        |abc company        |tm1@abc.com       |
|cewu@abc.com       |employerName        |ABC company llc    |tm2@abc.com       |
|cewu@abc.com       |employeeID          |...                |...               |
|cewu@abc.com       |employeeAddress     |...                |...               |
|cewu@abc.com       |...                 |...                |...               |
|-------------------+--------------------+-------------------+------------------|

I am able to parse the json string to get each sub nested fields. And try

select
j.*
from (
select
json_parse(json_string) as j
from tbl
)

But it return error that cannot use wildcard in super column. So I stocked at how to parse all nested json sub fields and cannot unpivot the table to [field, value] format.

Any idea how to sole this issue?

Thanks,

Ce Wu

Ce Wu
質問済み 1年前1074ビュー
1回答
0

Hello,

I understand that you would like to parse the nested JSON objects in your Redshift cluster. You have defined column as super datatype and would like to parse the objects. I would like to inform you that when you store the JSON as SUPER data and try to extract nested columns it is a little difficult. I tried using the following exercise.

SET enable_case_sensitive_identifier TO true; create table testparse(super_json super); insert into testparse values(JSON_PARSE('{ "lastUpdatedBy": "cewu@abc.com", "clientName": { "Values": [{ "Value": "Ce Wu", "Confidence": 100.0, "Source": "automation" }], "Overall": { "Value": "Ce Wu", "Confidence": 100.0, "Source": "automation" } }, "employerNumber": { "Values": [{ "Value": "abc company", "Confidence": 90.0, "Source": "tm1@abc.com" }, { "Value": "ABC company llc", "Confidence": 90.0, "Source": "tm2@abc.com" } ], "Overall": { "Value": "ABC company llc", "Confidence": 90.0, "Source": "tm2@abc.com" } } }'));

select super_json."lastUpdatedBy", super_json."clientName"."Overall"."Source" from testparse;

Output:


lastUpdatedBy  | Source
---------------------------
"cewu@abc.com" | "automation"	

Please find below the limitation of using SUPER datatype.

[+] https://docs.aws.amazon.com/redshift/latest/dg/limitations-super.html

However, a better way to parse nested columns in JSON is with Redshift spectrum. Redshift spectrum supports creating columns of STRUCT, ARRAY datatype which can further be used to flatten out the data. Please find below an example.

>> Upload the JSON data in S3
>> Create external table with the JSON data stored in S3

CREATE EXTERNAL TABLE myspectrum_schema.nest (
    lastUpdatedBy VARCHAR(255),
    clientName STRUCT<
        Values: ARRAY<STRUCT<
            Value: VARCHAR(255),
            Confidence: FLOAT,
            Source: VARCHAR(255)
        >>,
        Overall: STRUCT<
            Value: VARCHAR(255),
            Confidence: FLOAT,
            Source: VARCHAR(255)
        >
    >,
    employerNumber STRUCT<
        Values: ARRAY<STRUCT<
            Value: VARCHAR(255),
            Confidence: FLOAT,
            Source: VARCHAR(255)
        >>,
        Overall: STRUCT<
            Value: VARCHAR(255),
            Confidence: FLOAT,
            Source: VARCHAR(255)
        >
    >
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
   'serialization.format' = '1'
) 
LOCATION 's3://<bucket-name>/<folder-name>/';

>> create manifest file in S3 using AWS CLI 

aws s3 ls s3://<bucket-name>/<folder-name>
aws s3 ls s3://<bucket-name>/<folder-name> >manifest.txt

>> select the data using the below query

 SELECT
  nest.lastUpdatedBy, 
  nest.clientName.Overall.Value,
  nest.employerNumber.Overall.Value AS employerName
FROM myspectrum_schema.nest
;

Output:

lastUpdatedBy | Value | employerName
-------------------------------------------
cewu@abc.com  |	Ce Wu | ABC company llc	


Please refer the following tutorials to implement Redshift spectrum and nested data quering.

[+] https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data.html 
[+] https://aws.amazon.com/blogs/big-data/working-with-nested-data-types-using-amazon-redshift-spectrum/
[+] https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html 

That being said, if you would like resource based troubleshooting, please raise a support case with AWS for further information and we will get back to you with on the support case.

Thank you!
AWS
回答済み 1年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン