Athena - Parse JSON values into a table



The environment is, there are multiple JSON files in a S3 bucket. I would like to add all of them to the Athena Table with the filtered values. I used the ChatGPT for the Athena query to create the table. But for some reason, the top level values are captured, not anything nested under. I have attached the screenshot for reference, please check it. How can I fix it?

Please suggest the sample code, if you know the answer. Thanks!

Enter image description here

2 Answers

You were not able to see the whole value because when working with nested arrays, you often need to expand nested array elements into a single array, or expand the array into multiple rows. Amazon Athena involves using the ‘UNNEST’ function to break down arrays or structs into separate rows and columns. You need to unnest nested JSON data to transform it into a tabular format for easier querying and analysis. By unnesting, you convert the nested data into a format that Athena can work with, it will allow you to perform SQL queries and aggregations more easily and it also creates compatibility with many BI tools.

To flatten an array into multiple rows, use CROSS JOIN in conjunction with the UNNEST operator, as in this example:

         WITH dataset AS (
         'engineering' as department,
         ARRAY['Sharon', 'John', 'Bob', 'Sally'] as users
        SELECT department, names FROM dataset
        CROSS JOIN UNNEST(users) as t(names)

So the steps to flatten nested Jason data in Athena:

  1. Create a Table: Ensure that you have a table in Athena that represents your JSON data.

  2. Use ‘UNEST’ for Arrays. You can use the ‘UNNEST’ function to expand them into separate rows.
    Example: SELECT id, value FROM my_table CROSS JOIN UNNEST(array_column) AS t(value)

  3. Use ‘UNEST’ for Structs: if you have nested structs, you can use the ‘UNEST’ function to flatten them into separate columns. Example: SELECT id, struct_column.field1, struct_column.field2 FROM my_table CROSS JOIN UNNEST(struct_column) AS t(field1, field2)

  4. Filter Data: If needed, you can apply filters to select specific data or remove rows with null values.

For more information please refer the links below:

Hope it helps.

profile pictureAWS
answered 7 months ago

Here's how I achieved the solution for the above. Using the Glue crawler, created the Athena database. From there, created the table first

Created the Table CREATE EXTERNAL TABLE IF NOT EXISTS s3_Qualtrics_logs_db.Shiva_Q9 ( Channel VARCHAR(50), ConnectedToSystemTimestamp TIMESTAMP, ContactId VARCHAR(60), Attributes ARRAY< struct< AgentName:VARCHAR(50), Q1:VARCHAR(10), Q2:VARCHAR(10), Q3:VARCHAR(10), Q4:VARCHAR(10), Q5:VARCHAR(10), Q6:VARCHAR(255), CountryCode:VARCHAR(25)>> )

ROW FORMAT SERDE '' WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) LOCATION 's3://connect/QualtricsSample/';

Then able to query the table

SELECT channel, contactid, attributes[1].agentname AgentName, attributes[1].q1 Q1, attributes[1].q2 Q2, attributes[1].q3 Q3, attributes[1].q4 Q4, attributes[1].q5 Q5, attributes[1].q6 Q6, attributes[1].countrycode CountryCode FROM s3_qualtrics_logs_db.shiva_q9 limit 20;

profile picture
answered 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions