Unnesting a series of nested JSON objects from a JSON array in Athena

0

Hello,

I am currently using Service Now table dumps exported as JSON array files to S3.

The S3 folder path structure is similar to the following:

aws s3 ls s3://bucket/data/
       PRE data1-2022-09-22/
       PRE data2-2022-09-22/
       PRE data3-2022-09-22/

Each folder contains a series of JSON files in their own schema.

These files each contain an array of nested JSON objects, in the following structure:

[
    {
        "obj1": {
            "display_value": "Change",
            "value": "Change"
        },
        "obj2": {
            "display_value": "Cancel all future Tasks",
            "value": "cancel"
        },
        "obj3": {
            "display_value": "2021-01-21 15:00:01",
            "value": "2021-01-21 15:00:01"
        }
    },
    {
        "obj1": {
            "display_value": "Change",
            "value": "Change"
        },
        "obj2": {
            "display_value": "Cancel all future Tasks",
            "value": "cancel"
        },
        "obj3": {
            "display_value": "2021-01-07 20:36:34",
            "value": "2021-01-07 20:36:34"
        }
    },
    {
        "obj1": {
            "display_value": null,
            "value": ""
        },
        "obj2": {
            "display_value": "Cancel all future Tasks",
            "value": "cancel"
        },
        "obj3": {
            "display_value": "2021-02-11 19:40:40",
            "value": "2021-02-11 19:40:40"
        }
    }
]

As shown in the sample data provided, some of the values may be empty or null, but the overall the structure is the same.

When using a custom Glue JSON classifier, I am able to split the nested JSON objects into individual column names, where each column's data type is inferred as a struct, i.e.:

{
  "obj3": {
    "display_value": "string",
    "value": "string"
  }
}

Without using a custom Glue JSON classifier, the schema is inferred as a single column (named array), whose data type is an array containing all of the nested objects with elements and their types.

I note that using a custom classifier in Glue in this approach may not actually be the best way, and instead it may be preferrable to use no custom classifier, and then UNNEST the data from the array structure using an Athena query [1], using a CTAS to load it to S3.

Currently, I am seeking a way to unnest the JSON objects such that (for example above), the output would could show all of the nested values in individual rows (additional columns and values from sample data not shown):

obj1, obj2, obj3
{display_value="Change", value="Change"}, {display_value="Cancel all future Tasks", value="cancel"}, { display_value="2021-01-21 15:00:01", value="2021-01-21 15:00:01"} 

{display_value="Change", value="Change"}, {display_value="Cancel all future Tasks", value="cancel"}, {display_value="2021-01-07 20:36:34", value="2021-01-07 20:36:34"}

{display_value=null, value=""}, {display_value="Cancel all future Tasks", value="cancel"},  {display_value="2021-02-11 19:40:40", value="2021-02-11 19:40:40"}

May I have some guidance in constructing such an athena query to give this kind of output for the sample data, using either with the custom glue classifier or without?

Sources

[1] https://docs.aws.amazon.com/athena/latest/ug/flattening-arrays.html

AWS
asked a year ago1063 views
1 Answer
0

The blog post titled 'Analyze and visualize nested JSON data with Amazon Athena and Amazon QuickSight' was updated in 2022 (original 2019) . It describes methods to extract JSON data as tables using SQL queries in Amazon Athena. https://aws.amazon.com/blogs/big-data/analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight/

profile picture
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions