Redshift unnested json string

0

Hi guy,

I have a json string like:

{
    "lastUpdatedBy": "cewu@abc.com",
    "clientName": {
        "Values": [
            {
                "Value": "Ce Wu",
                "Confidence": 100.0,
                "Source": "automation"
            }
        ],
        "Overall": {
            "Value": "Ce Wu",
            "Confidence": 100.0,
            "Source": "automation"
        }
    },
    "employerNumber": {
        "Values": [
            {
                "Value": "abc company",
                "Confidence": 90.0,
                "Source": "tm1@abc.com",
            },
            {
                "Value": "ABC company llc",
                "Confidence": 90.0,
                "Source": "tm2@abc.com"
            }
        ],
        "Overall": {
            "Value": "ABC company llc",
            "Confidence": 90.0,
            "Source": "tm2@abc.com"
        }
    },
    "employeeID":{
        ...
    },
    "employeeAddress":{
        ...
    },
    ...
   there could be 100+ sub columns nested in the json string
}

Is there a way in Redshift I could unnested Values data from the json string to:

|-------------------+--------------------+-------------------+------------------|
|lastUpdatedBy      |field               |value              |source            |
|-------------------+--------------------+-------------------+------------------|
|cewu@abc.com       |clientName          |Ce Wu              |automation.       |
|cewu@abc.com       |employerName        |abc company        |tm1@abc.com       |
|cewu@abc.com       |employerName        |ABC company llc    |tm2@abc.com       |
|cewu@abc.com       |employeeID          |...                |...               |
|cewu@abc.com       |employeeAddress     |...                |...               |
|cewu@abc.com       |...                 |...                |...               |
|-------------------+--------------------+-------------------+------------------|

I am able to parse the json string to get each sub nested fields. And try

select
j.*
from (
select
json_parse(json_string) as j
from tbl
)

But it return error that cannot use wildcard in super column. So I stocked at how to parse all nested json sub fields and cannot unpivot the table to [field, value] format.

Any idea how to sole this issue?

Thanks,

Ce Wu

Ce Wu
asked a year ago1005 views
1 Answer
0

Hello,

I understand that you would like to parse the nested JSON objects in your Redshift cluster. You have defined column as super datatype and would like to parse the objects. I would like to inform you that when you store the JSON as SUPER data and try to extract nested columns it is a little difficult. I tried using the following exercise.

SET enable_case_sensitive_identifier TO true; create table testparse(super_json super); insert into testparse values(JSON_PARSE('{ "lastUpdatedBy": "cewu@abc.com", "clientName": { "Values": [{ "Value": "Ce Wu", "Confidence": 100.0, "Source": "automation" }], "Overall": { "Value": "Ce Wu", "Confidence": 100.0, "Source": "automation" } }, "employerNumber": { "Values": [{ "Value": "abc company", "Confidence": 90.0, "Source": "tm1@abc.com" }, { "Value": "ABC company llc", "Confidence": 90.0, "Source": "tm2@abc.com" } ], "Overall": { "Value": "ABC company llc", "Confidence": 90.0, "Source": "tm2@abc.com" } } }'));

select super_json."lastUpdatedBy", super_json."clientName"."Overall"."Source" from testparse;

Output:


lastUpdatedBy  | Source
---------------------------
"cewu@abc.com" | "automation"	

Please find below the limitation of using SUPER datatype.

[+] https://docs.aws.amazon.com/redshift/latest/dg/limitations-super.html

However, a better way to parse nested columns in JSON is with Redshift spectrum. Redshift spectrum supports creating columns of STRUCT, ARRAY datatype which can further be used to flatten out the data. Please find below an example.

>> Upload the JSON data in S3
>> Create external table with the JSON data stored in S3

CREATE EXTERNAL TABLE myspectrum_schema.nest (
    lastUpdatedBy VARCHAR(255),
    clientName STRUCT<
        Values: ARRAY<STRUCT<
            Value: VARCHAR(255),
            Confidence: FLOAT,
            Source: VARCHAR(255)
        >>,
        Overall: STRUCT<
            Value: VARCHAR(255),
            Confidence: FLOAT,
            Source: VARCHAR(255)
        >
    >,
    employerNumber STRUCT<
        Values: ARRAY<STRUCT<
            Value: VARCHAR(255),
            Confidence: FLOAT,
            Source: VARCHAR(255)
        >>,
        Overall: STRUCT<
            Value: VARCHAR(255),
            Confidence: FLOAT,
            Source: VARCHAR(255)
        >
    >
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
   'serialization.format' = '1'
) 
LOCATION 's3://<bucket-name>/<folder-name>/';

>> create manifest file in S3 using AWS CLI 

aws s3 ls s3://<bucket-name>/<folder-name>
aws s3 ls s3://<bucket-name>/<folder-name> >manifest.txt

>> select the data using the below query

 SELECT
  nest.lastUpdatedBy, 
  nest.clientName.Overall.Value,
  nest.employerNumber.Overall.Value AS employerName
FROM myspectrum_schema.nest
;

Output:

lastUpdatedBy | Value | employerName
-------------------------------------------
cewu@abc.com  |	Ce Wu | ABC company llc	


Please refer the following tutorials to implement Redshift spectrum and nested data quering.

[+] https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data.html 
[+] https://aws.amazon.com/blogs/big-data/working-with-nested-data-types-using-amazon-redshift-spectrum/
[+] https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html 

That being said, if you would like resource based troubleshooting, please raise a support case with AWS for further information and we will get back to you with on the support case.

Thank you!
AWS
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions