Json data is being considered as string while loading data from postgres to json file by AWS Glue

0

I want to migrate postgres data to redshift, but I have a lot of jsonb data in postgres so for that I had given SUPER data type in Redshift but the problem here is while loading the data to redshift it is adding escape characters like this :

"{\"logo\": \"https://s3.amazonaws.com/images.sequr.io/logo/logo_1_1480712118.png\", \"color\": \"#145a8b\", \"suppress_fc_for_26_bits_keycard\": true}"

which is very problematic while using Redshift JSON function. Actually I am loading the postgres data to json file by using pyspark data frames and then COPY command to load data from S3 to redshift but in the json file itself I have escape characters that's why it is loading data like this. I know this is happening because pyspark data frame is considering jsonb data as string and to escape the double quotes(") in json object it is adding backslash.

So can anyone help me how to avoid this problem so that I can load postgres data to json in correct format. I have tried everything but no help. Only solution I can think of is to loop through each row and insert the row into json file one by one which is not possible as I have data size of 150 million.

If any other method instead of pyspark I can use please let me know. I am stuck here for almost 2 weeks.

Thanks in Advance

1 Answer
0

It sounds the json files you write to S3 to use COPY json are not proper json objects. You probably have to do something like this assuming data is the jsonb column (you could also infer the schema):

df.withColumn('data', from_json(col('data'), schema)).select('data.*').write.json("s3://.....")

The second option is loading the column as string as you are doing and in Redshift use json_parse

profile pictureAWS
EXPERT
answered 10 months ago
  • Actually the problem here is, schema is not fixed, it has more than 20 fields which keeps o changing and sometimes blank value also, so I can not define schema here. This is the json data in postgres in case you need: {"logo": "https://s3.amazonaws.com/images.sequr.io/logo/logo_2_1480710891.png", "color": "#919036", "suppress_fc_for_26_bits_keycard": true}

    So I guess it is in correct format.

  • You could as Spark to read the column and infer the schema (something like this: .read.json(yourdf.rdd.map(lambda r: r[colName])).schema. But what is the point of trying to store data in a structured format like a RS table when it doesn't have a known schema?, maybe just store it as a json string

  • Yes, we are storing it as a json string into the SUPER data type, but if it is Super data type then it is adding escape characters but if it is a varchar type then it is inserting into the correct format, but for few of the mandatory keys we are flattening the records so that we can use them effectively, currently we have all these data in the elastic search so we are thinking it to store in redshift, I know the kind of data we have, it is better to store it in elastic search data ware house because we do not have fixed schema, specially mongodb data because there we have so many nested objects and to store that kind of data into redshift is very challenging.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions