Skip to content

Load super data on Redshift with COPY scapes JSON fields

0

Hi! I'm running an UNLOAD from Athena, and then a COPY to Redshift, I'm having an issue about how the SUPER fields are loaded into Redshift

UNLOAD (
    SELECT
          curpVerification AS curpVerification
          ,idOcrConfidence AS idOcrConfidence
          ,"$path" as s3_file_name
          ,cast(dt as date) dt
          ,localtimestamp AS last_synced_time
    FROM table
    WHERE dt = '2024-06-26'
    AND   "$path" like 's3://xxxxx'
) TO 's3://xxxxxx/test/'
WITH (format = 'json')

queriying the file on S3 as JSON I see this:

{
  "curpverification": "{\"probation_document_data\":{}}",
  "idocrconfidence": "{\"overallconfidence\":{\"value\":\"100.0\",\"status\":\"OK\"}}",
  "s3_file_name": "s3://xxxxx",
  "dt": "2024-06-26",
  "last_synced_time": "2024-06-26 14:49:49.567"
}

as CSV {"probation_document_data":{}},"{""overallconfidence"":{""value"":""100.0"",""status"":""OK""}}",xxxxxxxx,2024-06-26,2024-06-26 14:49:49.567

The issue is that when I COPY this data to a table with super fields (for the json fields) it load the JSON scaped as: "{\"overallconfidence\":{\"value\":\"100.0\",\"status\":\"OK\"}}"

copy schema.table ("curpVerification","idOcrConfidence","s3_file_name","dt","last_synced_time")
from 's3://xxxxxx'
iam_role 'xxxxx'
gzip
format as json 'auto ignorecase'

What I'm doing wrong? The super field should load as a parsed json and not an scaped one. Thanks!

  • Is your Athena accessing data from s3? Why unload from it before you copy into redshift? U can just query the Athena table directly using spectrum no?

asked 2 years ago1.3K views
1 Answer
0

The issue you're facing is due to the fact that the JSON data is being escaped during the UNLOAD operation from Athena. When you use the format as json option in the UNLOAD command, it outputs the JSON data as a string, which includes escaping the quotes in the JSON structure.

When this data is then loaded into Redshift using the COPY command, it's being treated as a string, not as a JSON object. That's why you're seeing the escaped characters in your SUPER column.

To resolve this, you can modify your UNLOAD command to output the data as CSV, and then use a JSON parsing function in Redshift to convert the string to a JSON object after it's been loaded.

Here's how you can modify your UNLOAD command:

UNLOAD (
    SELECT
          curpVerification::varchar AS curpVerification
          ,idOcrConfidence::varchar AS idOcrConfidence
          ,"$path" as s3_file_name
          ,cast(dt as date) dt
          ,localtimestamp AS last_synced_time
    FROM table
    WHERE dt = '2024-06-26'
    AND   "$path" like 's3://xxxxx'
) TO 's3://xxxxxx/test/'
WITH (format = 'csv')

And then, after loading the data into Redshift, you can use the PARSE_JSON function to convert the string to a JSON object:

UPDATE schema.table
SET curpVerification = PARSE_JSON(curpVerification),
    idOcrConfidence = PARSE_JSON(idOcrConfidence)

This should give you the desired result in your SUPER columns. Please replace schema.table with your actual schema and table name. Hope this helps! Let me know if you have any other questions.

EXPERT
answered 2 years ago
  • Hi! that solution could be ok from small tables, but running that on big tables doesn't sound like a optimal solution, which format/parameters could I use to UNLOAD from Athema and load unescaped fields on Redshift? Thanks!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.