Unnesting a series of nested JSON objects from a JSON array in Athena

0

Hello,

I am currently using Service Now table dumps exported as JSON array files to S3.

The S3 folder path structure is similar to the following:

aws s3 ls s3://bucket/data/
       PRE data1-2022-09-22/
       PRE data2-2022-09-22/
       PRE data3-2022-09-22/

Each folder contains a series of JSON files in their own schema.

These files each contain an array of nested JSON objects, in the following structure:

[
    {
        "obj1": {
            "display_value": "Change",
            "value": "Change"
        },
        "obj2": {
            "display_value": "Cancel all future Tasks",
            "value": "cancel"
        },
        "obj3": {
            "display_value": "2021-01-21 15:00:01",
            "value": "2021-01-21 15:00:01"
        }
    },
    {
        "obj1": {
            "display_value": "Change",
            "value": "Change"
        },
        "obj2": {
            "display_value": "Cancel all future Tasks",
            "value": "cancel"
        },
        "obj3": {
            "display_value": "2021-01-07 20:36:34",
            "value": "2021-01-07 20:36:34"
        }
    },
    {
        "obj1": {
            "display_value": null,
            "value": ""
        },
        "obj2": {
            "display_value": "Cancel all future Tasks",
            "value": "cancel"
        },
        "obj3": {
            "display_value": "2021-02-11 19:40:40",
            "value": "2021-02-11 19:40:40"
        }
    }
]

As shown in the sample data provided, some of the values may be empty or null, but the overall the structure is the same.

When using a custom Glue JSON classifier, I am able to split the nested JSON objects into individual column names, where each column's data type is inferred as a struct, i.e.:

{
  "obj3": {
    "display_value": "string",
    "value": "string"
  }
}

Without using a custom Glue JSON classifier, the schema is inferred as a single column (named array), whose data type is an array containing all of the nested objects with elements and their types.

I note that using a custom classifier in Glue in this approach may not actually be the best way, and instead it may be preferrable to use no custom classifier, and then UNNEST the data from the array structure using an Athena query [1], using a CTAS to load it to S3.

Currently, I am seeking a way to unnest the JSON objects such that (for example above), the output would could show all of the nested values in individual rows (additional columns and values from sample data not shown):

obj1, obj2, obj3
{display_value="Change", value="Change"}, {display_value="Cancel all future Tasks", value="cancel"}, { display_value="2021-01-21 15:00:01", value="2021-01-21 15:00:01"} 

{display_value="Change", value="Change"}, {display_value="Cancel all future Tasks", value="cancel"}, {display_value="2021-01-07 20:36:34", value="2021-01-07 20:36:34"}

{display_value=null, value=""}, {display_value="Cancel all future Tasks", value="cancel"},  {display_value="2021-02-11 19:40:40", value="2021-02-11 19:40:40"}

May I have some guidance in constructing such an athena query to give this kind of output for the sample data, using either with the custom glue classifier or without?

Sources

[1] https://docs.aws.amazon.com/athena/latest/ug/flattening-arrays.html

AWS
已提問 1 年前檢視次數 1115 次
1 個回答
0

The blog post titled 'Analyze and visualize nested JSON data with Amazon Athena and Amazon QuickSight' was updated in 2022 (original 2019) . It describes methods to extract JSON data as tables using SQL queries in Amazon Athena. https://aws.amazon.com/blogs/big-data/analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight/

profile picture
已回答 1 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南