- Newest
- Most votes
- Most comments
Hello,
I understand that you would like to parse the nested JSON objects in your Redshift cluster. You have defined column as super datatype and would like to parse the objects. I would like to inform you that when you store the JSON as SUPER data and try to extract nested columns it is a little difficult. I tried using the following exercise.
SET enable_case_sensitive_identifier TO true; create table testparse(super_json super); insert into testparse values(JSON_PARSE('{ "lastUpdatedBy": "cewu@abc.com", "clientName": { "Values": [{ "Value": "Ce Wu", "Confidence": 100.0, "Source": "automation" }], "Overall": { "Value": "Ce Wu", "Confidence": 100.0, "Source": "automation" } }, "employerNumber": { "Values": [{ "Value": "abc company", "Confidence": 90.0, "Source": "tm1@abc.com" }, { "Value": "ABC company llc", "Confidence": 90.0, "Source": "tm2@abc.com" } ], "Overall": { "Value": "ABC company llc", "Confidence": 90.0, "Source": "tm2@abc.com" } } }'));
select super_json."lastUpdatedBy", super_json."clientName"."Overall"."Source" from testparse;
Output:
lastUpdatedBy | Source
---------------------------
"cewu@abc.com" | "automation"
Please find below the limitation of using SUPER datatype.
[+] https://docs.aws.amazon.com/redshift/latest/dg/limitations-super.html
However, a better way to parse nested columns in JSON is with Redshift spectrum. Redshift spectrum supports creating columns of STRUCT, ARRAY datatype which can further be used to flatten out the data. Please find below an example.
>> Upload the JSON data in S3
>> Create external table with the JSON data stored in S3
CREATE EXTERNAL TABLE myspectrum_schema.nest (
lastUpdatedBy VARCHAR(255),
clientName STRUCT<
Values: ARRAY<STRUCT<
Value: VARCHAR(255),
Confidence: FLOAT,
Source: VARCHAR(255)
>>,
Overall: STRUCT<
Value: VARCHAR(255),
Confidence: FLOAT,
Source: VARCHAR(255)
>
>,
employerNumber STRUCT<
Values: ARRAY<STRUCT<
Value: VARCHAR(255),
Confidence: FLOAT,
Source: VARCHAR(255)
>>,
Overall: STRUCT<
Value: VARCHAR(255),
Confidence: FLOAT,
Source: VARCHAR(255)
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
LOCATION 's3://<bucket-name>/<folder-name>/';
>> create manifest file in S3 using AWS CLI
aws s3 ls s3://<bucket-name>/<folder-name>
aws s3 ls s3://<bucket-name>/<folder-name> >manifest.txt
>> select the data using the below query
SELECT
nest.lastUpdatedBy,
nest.clientName.Overall.Value,
nest.employerNumber.Overall.Value AS employerName
FROM myspectrum_schema.nest
;
Output:
lastUpdatedBy | Value | employerName
-------------------------------------------
cewu@abc.com | Ce Wu | ABC company llc
Please refer the following tutorials to implement Redshift spectrum and nested data quering.
[+] https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data.html
[+] https://aws.amazon.com/blogs/big-data/working-with-nested-data-types-using-amazon-redshift-spectrum/
[+] https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html
That being said, if you would like resource based troubleshooting, please raise a support case with AWS for further information and we will get back to you with on the support case.
Thank you!
Relevant content
- asked 3 years ago
- asked 2 years ago
- asked 8 months ago