How to handle schema changes and versions in the Glue Data Catalog?

0

Given a dataset with a temporally changing schema, how do you:

  1. Specify a schema version when running a job/query? i.e. I want to specify version X of a table because I know that the data I am querying against was created when version X was current.
  2. Query data across versions? i.e. Data in January has slightly different schema version than data in February and I want to run a query across both.

For scenario 2, I've run some tests with an evolving schema and adding or dropping columns creates no issues. If the column type changes though (i.e a TIMESTAMP changes to a STRING), then queries against data that don't match the catalog fail. I've tested this behavior in both Athena and Glue (DynamicFrame can handle some data type inconsistencies).

My thought would be to do separate queries and then UNION the results, but that requires the answer to scenario 1...

已提问 5 年前4309 查看次数
1 回答
1
已接受的回答

Version by itself is not available as a column to Athena. In Athena/Presto, you can use the typeOf function to determine the type of a column and choose how you want to process that. A Case statement would also work. e.g.

SELECT  x, typeof(x)  
from (select timestamp '2012-10-31 01:00 UTC' as x)
where typeof(x) like 'timestamp%'
UNION
SELECT date_parse(x,'%Y-%m-%d %h:%i:%s'), typeof(x)
from (select '2012-10-31 01:00:00' as x)
where typeof(x) like 'varchar%'

But ideally even if you have a varying schema in your raw data tier, you should be resolving that schema in the etl layer so that in the analytics tier you have a single schema with perhaps 2 columns - one for the original column, one for the changed column. Then at the serving layer it is easy to build a view to coalesce the columns if needed e.g.: if a string column is being changed to timestamp

SELECT coalesce(new_column, date_parse(old_column,'%m/%d/%Y %h:%i:%s %p')) as timestamp_column
AWS
已回答 5 年前
profile picture
专家
已审核 1 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则