Unnesting a series of nested JSON objects from a JSON array in Athena

0

Hello,

I am currently using Service Now table dumps exported as JSON array files to S3.

The S3 folder path structure is similar to the following:

aws s3 ls s3://bucket/data/
       PRE data1-2022-09-22/
       PRE data2-2022-09-22/
       PRE data3-2022-09-22/

Each folder contains a series of JSON files in their own schema.

These files each contain an array of nested JSON objects, in the following structure:

[
    {
        "obj1": {
            "display_value": "Change",
            "value": "Change"
        },
        "obj2": {
            "display_value": "Cancel all future Tasks",
            "value": "cancel"
        },
        "obj3": {
            "display_value": "2021-01-21 15:00:01",
            "value": "2021-01-21 15:00:01"
        }
    },
    {
        "obj1": {
            "display_value": "Change",
            "value": "Change"
        },
        "obj2": {
            "display_value": "Cancel all future Tasks",
            "value": "cancel"
        },
        "obj3": {
            "display_value": "2021-01-07 20:36:34",
            "value": "2021-01-07 20:36:34"
        }
    },
    {
        "obj1": {
            "display_value": null,
            "value": ""
        },
        "obj2": {
            "display_value": "Cancel all future Tasks",
            "value": "cancel"
        },
        "obj3": {
            "display_value": "2021-02-11 19:40:40",
            "value": "2021-02-11 19:40:40"
        }
    }
]

As shown in the sample data provided, some of the values may be empty or null, but the overall the structure is the same.

When using a custom Glue JSON classifier, I am able to split the nested JSON objects into individual column names, where each column's data type is inferred as a struct, i.e.:

{
  "obj3": {
    "display_value": "string",
    "value": "string"
  }
}

Without using a custom Glue JSON classifier, the schema is inferred as a single column (named array), whose data type is an array containing all of the nested objects with elements and their types.

I note that using a custom classifier in Glue in this approach may not actually be the best way, and instead it may be preferrable to use no custom classifier, and then UNNEST the data from the array structure using an Athena query [1], using a CTAS to load it to S3.

Currently, I am seeking a way to unnest the JSON objects such that (for example above), the output would could show all of the nested values in individual rows (additional columns and values from sample data not shown):

obj1, obj2, obj3
{display_value="Change", value="Change"}, {display_value="Cancel all future Tasks", value="cancel"}, { display_value="2021-01-21 15:00:01", value="2021-01-21 15:00:01"} 

{display_value="Change", value="Change"}, {display_value="Cancel all future Tasks", value="cancel"}, {display_value="2021-01-07 20:36:34", value="2021-01-07 20:36:34"}

{display_value=null, value=""}, {display_value="Cancel all future Tasks", value="cancel"},  {display_value="2021-02-11 19:40:40", value="2021-02-11 19:40:40"}

May I have some guidance in constructing such an athena query to give this kind of output for the sample data, using either with the custom glue classifier or without?

Sources

[1] https://docs.aws.amazon.com/athena/latest/ug/flattening-arrays.html

AWS
質問済み 1年前1115ビュー
1回答
0

The blog post titled 'Analyze and visualize nested JSON data with Amazon Athena and Amazon QuickSight' was updated in 2022 (original 2019) . It describes methods to extract JSON data as tables using SQL queries in Amazon Athena. https://aws.amazon.com/blogs/big-data/analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight/

profile picture
回答済み 1年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ