Unnesting a series of nested JSON objects from a JSON array in Athena

0

Hello,

I am currently using Service Now table dumps exported as JSON array files to S3.

The S3 folder path structure is similar to the following:

aws s3 ls s3://bucket/data/
       PRE data1-2022-09-22/
       PRE data2-2022-09-22/
       PRE data3-2022-09-22/

Each folder contains a series of JSON files in their own schema.

These files each contain an array of nested JSON objects, in the following structure:

[
    {
        "obj1": {
            "display_value": "Change",
            "value": "Change"
        },
        "obj2": {
            "display_value": "Cancel all future Tasks",
            "value": "cancel"
        },
        "obj3": {
            "display_value": "2021-01-21 15:00:01",
            "value": "2021-01-21 15:00:01"
        }
    },
    {
        "obj1": {
            "display_value": "Change",
            "value": "Change"
        },
        "obj2": {
            "display_value": "Cancel all future Tasks",
            "value": "cancel"
        },
        "obj3": {
            "display_value": "2021-01-07 20:36:34",
            "value": "2021-01-07 20:36:34"
        }
    },
    {
        "obj1": {
            "display_value": null,
            "value": ""
        },
        "obj2": {
            "display_value": "Cancel all future Tasks",
            "value": "cancel"
        },
        "obj3": {
            "display_value": "2021-02-11 19:40:40",
            "value": "2021-02-11 19:40:40"
        }
    }
]

As shown in the sample data provided, some of the values may be empty or null, but the overall the structure is the same.

When using a custom Glue JSON classifier, I am able to split the nested JSON objects into individual column names, where each column's data type is inferred as a struct, i.e.:

{
  "obj3": {
    "display_value": "string",
    "value": "string"
  }
}

Without using a custom Glue JSON classifier, the schema is inferred as a single column (named array), whose data type is an array containing all of the nested objects with elements and their types.

I note that using a custom classifier in Glue in this approach may not actually be the best way, and instead it may be preferrable to use no custom classifier, and then UNNEST the data from the array structure using an Athena query [1], using a CTAS to load it to S3.

Currently, I am seeking a way to unnest the JSON objects such that (for example above), the output would could show all of the nested values in individual rows (additional columns and values from sample data not shown):

obj1, obj2, obj3
{display_value="Change", value="Change"}, {display_value="Cancel all future Tasks", value="cancel"}, { display_value="2021-01-21 15:00:01", value="2021-01-21 15:00:01"} 

{display_value="Change", value="Change"}, {display_value="Cancel all future Tasks", value="cancel"}, {display_value="2021-01-07 20:36:34", value="2021-01-07 20:36:34"}

{display_value=null, value=""}, {display_value="Cancel all future Tasks", value="cancel"},  {display_value="2021-02-11 19:40:40", value="2021-02-11 19:40:40"}

May I have some guidance in constructing such an athena query to give this kind of output for the sample data, using either with the custom glue classifier or without?

Sources

[1] https://docs.aws.amazon.com/athena/latest/ug/flattening-arrays.html

AWS
질문됨 일 년 전1115회 조회
1개 답변
0

The blog post titled 'Analyze and visualize nested JSON data with Amazon Athena and Amazon QuickSight' was updated in 2022 (original 2019) . It describes methods to extract JSON data as tables using SQL queries in Amazon Athena. https://aws.amazon.com/blogs/big-data/analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight/

profile picture
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠