- Newest
- Most votes
- Most comments
Hi Raghav,
The approach you are using is correct. AWS Glue DynamicFrames are a perfect fit for this kind of data. Transforming the data to a simpler layout will certainly help you simplify your queries.
I was able to get the schema populated for the sample data above:
Code Snippet I used:
piece1 = glueContext.create_dynamic_frame.from_options(
format_options={"multiline": True, "jsonPath":"$[*]"},
connection_type="s3",
format="json",
connection_options={"paths": ["s3://BUCKET_NAME/PREFIX_KEY/repost_sample.json"]},
transformation_ctx="S3bucket_node1",
)
piece1.printSchema()
Notice that I omitted the
recurse
parameter - since you are reading a single file and not a nested directory, this is not required. Also, I added thejsonPath
toformat_options
to specify the location of the records within JSON.
I was able to get the right schema (with choice
datatype created as per Glue's default behaviour - which allows us to use ResolveChoice
resolve conflicts in datatype - Refer example in docs)
root
|-- bucket_name: string
|-- bucket_creation_date: string
|-- additional_data: struct
| |-- bucket_acl: array
| | |-- element: struct
| | | |-- Grantee: struct
| | | | |-- DisplayName: string
| | | | |-- ID: string
| | | | |-- Type: string
| | | |-- Permission: string
| |-- bucket_policy: struct
| | |-- Version: string
| | |-- Id: string
| | |-- Statement: array
| | | |-- element: struct
| | | | |-- Sid: string
| | | | |-- Effect: string
| | | | |-- Principal: choice
| | | | | |-- string
| | | | | |-- struct
| | | | | | |-- Service: string
| | | | |-- Action: choice
| | | | | |-- array
| | | | | | |-- element: string
| | | | | |-- string
| | | | |-- Resource: choice
| | | | | |-- array
| | | | | | |-- element: string
| | | | | |-- string
| | | | |-- Condition: struct
| | | | | |-- Bool: struct
| | | | | | |-- aws_SecureTransport: string
| |-- public_access_block_configuration: struct
| | |-- BlockPublicAcls: boolean
| | |-- IgnorePublicAcls: boolean
| | |-- BlockPublicPolicy: boolean
| | |-- RestrictPublicBuckets: boolean
| |-- website_hosting: struct
| |-- bucket_tags: array
| | |-- element: struct
| | | |-- Key: string
| | | |-- Value: string
|-- processed_data: struct
In your case, you can use make_struct
or make_cols
specification (Refer docs for more info on these specs) to resolve type conflict and you can easily write a query to check the 2 columns (nested columns if you use make_struct
) for values.
If you wish to continue using Spark DataFrames, I'd say string
is actually the right choice for the Action
column since it has to accommodate both string
and array
types. You can use json functions within your query engine to extract values necessary.
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago
If you know the schema ahead, Could you define the schema and try. For the cases where fields can be string or array, Spark defaults it to String. you could programatically try checking the string and define your logic based on how it looks. I dont know if there is way to define it as list for some records while string for some records.