When migrating data from Amazon Redshift to DynamoDB using AWS Glue (Spark), handling SUPER data types needs careful consideration. By default, Spark interprets Redshift's SUPER data types (including Structs, Maps, and Arrays) as simple strings during the read operation. This leads to significant data integrity issues downstream, when the data is written to DynamoDB, as the complex JSON structures are stored as string values rather than maintaining their original hierarchical format.
Here are some ways in which we can work around the issue by transforming the SUPER data columns with the below code snippets (assuming that the columns with super data types in Redshift are already known) before writing the data in the Dynamodb table. The code works for both fixed as well as dynamic JSON schema in the columns.
Approach 1: Using custom code to transform the JSON schema in the columns with super data type in Redshift (columns c_orders1, c_orders2 and c_orders3)
#Sample data in Redshift:
CREATE TABLE public.jsontest
(c_name varchar,
age bigint,
c_orders1 super,
c_orders2 super,
c_orders3 super
);
#Read Redshift data in Glue dynamic frame
#Infer schema of each JSON column and create new columns with inferred struct schema
#Write dataframe to dynamodb table
#Dynamodb table output after writing the above dataframe to the table
Approach 2: Using the inbuilt glue dynamicframe transform ‘gs_parse_json(colName)’ to transform the columns with JSON string to corresponding struct type, without using dataframe for the conversion.
When we use "Parse JSON Column" transform in visual etl glue job the following Python library path gets added in Advanced properties of the job "s3://aws-glue-studio-transforms-510798373988-prod-us-east-1/gs_common.py,s3://aws-glue-studio-transforms-510798373988-prod-us-east-1/gs_parse_json.py"
These are the internal buckets in which the packages exists. In case if you use create a spark ETL job or Glue notebook, to use this transform you have to add the above python library path explicitly you can refer below steps for the same:
To add in glue job without visual:
Go to job details > Under Libraries > Python library path > add the above s3 paths:
[Python library path is same as adding --extra-py-files in job parameters.]
To add in glue notebook use:
%extra_py_files <Comma separated list of additional Python files from Amazon S3>
Example:
%extra_py_files s3://aws-glue-studio-transforms-510798373988-prod-us-east-1/gs_common.py,s3://aws-glue-studio-transforms-510798373988-prod-us-east-1/gs_parse_json.py
For using it in the job, we need to add the above py files in the job libraries (Go to job details > Under Libraries > Python library path > add the above s3 paths)
We can now write the output dynamicframe to the DynamoDB table keeping the original JSON schema intact.