in Athena how to declare a piece of generic json inside of known-schema json?

0

I have json-per-line files in S3 that look like this:

{ "eventTimestamp": "...", "eventType": "...", "eventPayload": { complex object of unknown schema } }

How can I define an Athena table that globs the entire eventPayload as a field, without having to define the schema for it?

The goal is to have a schema that can be used for further data manipulation, CTAS, etc, without losing data.

AlexR
已提问 3 个月前451 查看次数
1 回答
0

Without accounting for performance or optimization, the simplest way to do this would be to run a Glue crawler on the S3 bucket where you're files are stored. Glue crawler will automatically infer the schema of these files and store the column type of "eventPayload" as a struct. You can choose to schedule subsequent crawler runs to continue inferring the schema as new events are created and new fields in your payload will be added to the struct. I have created a sample event and tested this in my AWS account.

I used the following data from two different source files in my S3 bucket with different fields in my event payload: data

The Glue crawler inferred the following schema for my table: Glue table schema

I'm able to extract single fields from event payload with the following query in Athena:

SELECT json_extract(CAST(eventpayload AS JSON), '$.type') AS type
FROM [TABLE_NAME];

Events that have a payload without the field (e.g. type) you query in their payload will return "null".

The table that the Glue crawler populates can further be used to manipulate or transform the data with ETL jobs or directly through Athena.

For more information about which Athena JSON functions are available, see the Athena documentation here. The Glue crawler documentation can be found here.

profile pictureAWS
专家
已回答 3 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则