Data transformation not taken into account in AWS Glue

0

I have a S3 bucket with folders in which we have files. I want to make a database to be able to query these documents on a few keys with an API based on Lambda. But for that I need to normalize the data. For example I need to transform all the files in the folder /jomalone/ as the following:

{
  "data": {
    "products": {
      "items": [
        {
          "default_category": {
            "id": "25956",
            "value": "Bath & Body"
          },
          "description": "London's Covent Garden early morning market.  Succulent nectarine, peach and cassis and delicate spring flowers melt into the note of acacia honey.  Sweet and delightfully playful.  Our luxuriously rich Body Crème with its conditioning oils of jojoba seed, cocoa seed and sweet almond, help to hydrate, nourish and protect the skin, while delicious signature fragrances leave your body scented all over.",
          "display_name": "Nectarine Blossom & Honey Body Crème",
          "is_hazmat": false,
          "meta": {
            "description": "The Jo Malone™ Nectarine Blossom & Honey Body Crème leaves skin beautifully scented with fruity notes of nectarine and peach sweetened with acacia honey."
          },
                  ...
                  {
                    "currency": "EUR",
                    "is_discounted": false,
                    "include_tax": {
                      "price": 68,
                      "original_price": 68,
                      "price_per_unit": 38.86,
                      "price_formatted": "€68.00",
                      "original_price_formatted": "€68.00",
                      "price_per_unit_formatted": "€38.86 / 100ML"
                    }
                  }
                ],
                "sizes": [
                  {
                    "value": "175ML",
                    "key": 1
                  }
                ],
                "shades": [
                  {
                    "name": "",
                    "description": "",
                    "hex_val": ""
                  }
                ],
                "sku_id": "L4P801",
                "sku_badge": null,
                "unit_size_formatted": "100ML",
                "upc": "690251040254",
                "is_engravable": null,
                "perlgem": {
                  "SKU_BASE_ID": 63584
                },
                "media": {
                  "large": [
                    {
                      "src": "/media/export/cms/products/1000x1000/jo_sku_L4P801_1000x1000_0.png",
                      "alt": "Nectarine Blossom & Honey Body Crème",
                      "height": 1000,
                      "width": 1000
                    },
                    {
                      "src": "/media/export/cms/products/1000x1000/jo_sku_L4P801_1000x1000_1.png",
                      "alt": "Nectarine Blossom & Honey Body Crème",
                      "height": 1000,
                      "width": 1000
                    }
                  ],
                  "medium": [
                    {
                      "src": "/media/export/cms/products/670x670/jo_sku_L4P801_670x670_0.png",
                      "alt": "Nectarine Blossom & Honey Body Crème",
                      "height": 670,
                      "width": 670
                    }
                  ],
                  "small": [
                    {
                      "src": "/media/export/cms/products/100x100/jo_sku_L4P801_100x100_0.png",
                      "alt": "Nectarine Blossom & Honey Body Crème",
                      "height": 100,
                      "width": 100
                    }
                  ]
                },
                "collection": null,
                "recipient": [
                  {
                    "key": "mom-recipient",
                    "value": "mom_recipient"
                  },
                  {
                    "key": "bride-recipient",
                    "value": "bride_recipient"
                  },
                  {
                    "key": "host-recipient",
                    "value": "host_recipient"
                  },
                  {
                    "key": "me-recipient",
                    "value": "me_recipient"
                  },
                  {
                    "key": "her-recipient",
                    "value": "her_recipient"
                  }
                ],
                "occasion": [
                  {
                    "key": "thankyou-occasion",
                    "value": "thankyou_occasion"
                  },
                  {
                    "key": "birthday-occasion",
                    "value": "birthday_occasion"
                  },
                  {
                    "key": "treat-occasion",
                    "value": "treat_occasion"
                  }
                ],
                "location": [
                  {
                    "key": "bathroom-location",
                    "value": "bathroom_location"
                  }
                ]
              }
            ]
          }
        }
      ]
    }
  }
}

In a json with the following schema:

brandName     String
productName   String 
productLink   String
productType   ?
maleFemale    Male/Female
price         float
unitPrice     String
size          float
ingredients   String
notes         String
numReviews    Int
userIDs       float
locations     float
dates         Date
ages          int
sexes         M/F
ratings       Int
reviews       Array of String
sources       String
characteristics  String
specificRatings  String

So I have tried AWS Glue but I don't know how to get rid of the nested data as the keys at the beginning:

  "data": {
    "products": {
      "items": [
          ...

Indeed, I used to test the modifications on the names:

introducir la descripción de la imagen aquí

But it doesn't seem to have any of the consequences I was looking for if I am to believe the Preview tab:

introducir la descripción de la imagen aquí

I had indeed deleted the first and last soubrayed fields and modified the others but none of this seems to have been taken into account in the Preview.

Indeed it doesn't seem there is anyhting like at least mapping in the related script:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
    format_options={"multiline": False},
    connection_type="s3",
    format="json",
    connection_options={"paths": ["s3://datahubpredicity/JoMalone/"], "recurse": True},
    transformation_ctx="S3bucket_node1",
)

# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
    frame=S3bucket_node1,
    mappings=[("data.products.items", "array", "data.products.items", "array")],
    transformation_ctx="ApplyMapping_node2",
)

# Script generated for node S3 bucket
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
    frame=ApplyMapping_node2,
    connection_type="s3",
    format="json",
    connection_options={"path": "s3://datahubpredicity/merged/", "partitionKeys": []},
    transformation_ctx="S3bucket_node3",
)

job.commit()
asked 2 years ago78 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions