Redshift unnested json string

0

Hi guy,

I have a json string like:

{
    "lastUpdatedBy": "cewu@abc.com",
    "clientName": {
        "Values": [
            {
                "Value": "Ce Wu",
                "Confidence": 100.0,
                "Source": "automation"
            }
        ],
        "Overall": {
            "Value": "Ce Wu",
            "Confidence": 100.0,
            "Source": "automation"
        }
    },
    "employerNumber": {
        "Values": [
            {
                "Value": "abc company",
                "Confidence": 90.0,
                "Source": "tm1@abc.com",
            },
            {
                "Value": "ABC company llc",
                "Confidence": 90.0,
                "Source": "tm2@abc.com"
            }
        ],
        "Overall": {
            "Value": "ABC company llc",
            "Confidence": 90.0,
            "Source": "tm2@abc.com"
        }
    },
    "employeeID":{
        ...
    },
    "employeeAddress":{
        ...
    },
    ...
   there could be 100+ sub columns nested in the json string
}

Is there a way in Redshift I could unnested Values data from the json string to:

|-------------------+--------------------+-------------------+------------------|
|lastUpdatedBy      |field               |value              |source            |
|-------------------+--------------------+-------------------+------------------|
|cewu@abc.com       |clientName          |Ce Wu              |automation.       |
|cewu@abc.com       |employerName        |abc company        |tm1@abc.com       |
|cewu@abc.com       |employerName        |ABC company llc    |tm2@abc.com       |
|cewu@abc.com       |employeeID          |...                |...               |
|cewu@abc.com       |employeeAddress     |...                |...               |
|cewu@abc.com       |...                 |...                |...               |
|-------------------+--------------------+-------------------+------------------|

I am able to parse the json string to get each sub nested fields. And try

select
j.*
from (
select
json_parse(json_string) as j
from tbl
)

But it return error that cannot use wildcard in super column. So I stocked at how to parse all nested json sub fields and cannot unpivot the table to [field, value] format.

Any idea how to sole this issue?

Thanks,

Ce Wu

Ce Wu
질문됨 일 년 전1071회 조회
1개 답변
0

Hello,

I understand that you would like to parse the nested JSON objects in your Redshift cluster. You have defined column as super datatype and would like to parse the objects. I would like to inform you that when you store the JSON as SUPER data and try to extract nested columns it is a little difficult. I tried using the following exercise.

SET enable_case_sensitive_identifier TO true; create table testparse(super_json super); insert into testparse values(JSON_PARSE('{ "lastUpdatedBy": "cewu@abc.com", "clientName": { "Values": [{ "Value": "Ce Wu", "Confidence": 100.0, "Source": "automation" }], "Overall": { "Value": "Ce Wu", "Confidence": 100.0, "Source": "automation" } }, "employerNumber": { "Values": [{ "Value": "abc company", "Confidence": 90.0, "Source": "tm1@abc.com" }, { "Value": "ABC company llc", "Confidence": 90.0, "Source": "tm2@abc.com" } ], "Overall": { "Value": "ABC company llc", "Confidence": 90.0, "Source": "tm2@abc.com" } } }'));

select super_json."lastUpdatedBy", super_json."clientName"."Overall"."Source" from testparse;

Output:


lastUpdatedBy  | Source
---------------------------
"cewu@abc.com" | "automation"	

Please find below the limitation of using SUPER datatype.

[+] https://docs.aws.amazon.com/redshift/latest/dg/limitations-super.html

However, a better way to parse nested columns in JSON is with Redshift spectrum. Redshift spectrum supports creating columns of STRUCT, ARRAY datatype which can further be used to flatten out the data. Please find below an example.

>> Upload the JSON data in S3
>> Create external table with the JSON data stored in S3

CREATE EXTERNAL TABLE myspectrum_schema.nest (
    lastUpdatedBy VARCHAR(255),
    clientName STRUCT<
        Values: ARRAY<STRUCT<
            Value: VARCHAR(255),
            Confidence: FLOAT,
            Source: VARCHAR(255)
        >>,
        Overall: STRUCT<
            Value: VARCHAR(255),
            Confidence: FLOAT,
            Source: VARCHAR(255)
        >
    >,
    employerNumber STRUCT<
        Values: ARRAY<STRUCT<
            Value: VARCHAR(255),
            Confidence: FLOAT,
            Source: VARCHAR(255)
        >>,
        Overall: STRUCT<
            Value: VARCHAR(255),
            Confidence: FLOAT,
            Source: VARCHAR(255)
        >
    >
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
   'serialization.format' = '1'
) 
LOCATION 's3://<bucket-name>/<folder-name>/';

>> create manifest file in S3 using AWS CLI 

aws s3 ls s3://<bucket-name>/<folder-name>
aws s3 ls s3://<bucket-name>/<folder-name> >manifest.txt

>> select the data using the below query

 SELECT
  nest.lastUpdatedBy, 
  nest.clientName.Overall.Value,
  nest.employerNumber.Overall.Value AS employerName
FROM myspectrum_schema.nest
;

Output:

lastUpdatedBy | Value | employerName
-------------------------------------------
cewu@abc.com  |	Ce Wu | ABC company llc	


Please refer the following tutorials to implement Redshift spectrum and nested data quering.

[+] https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data.html 
[+] https://aws.amazon.com/blogs/big-data/working-with-nested-data-types-using-amazon-redshift-spectrum/
[+] https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html 

That being said, if you would like resource based troubleshooting, please raise a support case with AWS for further information and we will get back to you with on the support case.

Thank you!
AWS
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인