How to handle schema changes and versions in the Glue Data Catalog?

0

Given a dataset with a temporally changing schema, how do you:

  1. Specify a schema version when running a job/query? i.e. I want to specify version X of a table because I know that the data I am querying against was created when version X was current.
  2. Query data across versions? i.e. Data in January has slightly different schema version than data in February and I want to run a query across both.

For scenario 2, I've run some tests with an evolving schema and adding or dropping columns creates no issues. If the column type changes though (i.e a TIMESTAMP changes to a STRING), then queries against data that don't match the catalog fail. I've tested this behavior in both Athena and Glue (DynamicFrame can handle some data type inconsistencies).

My thought would be to do separate queries and then UNION the results, but that requires the answer to scenario 1...

已提問 5 年前檢視次數 4309 次
1 個回答
1
已接受的答案

Version by itself is not available as a column to Athena. In Athena/Presto, you can use the typeOf function to determine the type of a column and choose how you want to process that. A Case statement would also work. e.g.

SELECT  x, typeof(x)  
from (select timestamp '2012-10-31 01:00 UTC' as x)
where typeof(x) like 'timestamp%'
UNION
SELECT date_parse(x,'%Y-%m-%d %h:%i:%s'), typeof(x)
from (select '2012-10-31 01:00:00' as x)
where typeof(x) like 'varchar%'

But ideally even if you have a varying schema in your raw data tier, you should be resolving that schema in the etl layer so that in the analytics tier you have a single schema with perhaps 2 columns - one for the original column, one for the changed column. Then at the serving layer it is easy to build a view to coalesce the columns if needed e.g.: if a string column is being changed to timestamp

SELECT coalesce(new_column, date_parse(old_column,'%m/%d/%Y %h:%i:%s %p')) as timestamp_column
AWS
已回答 5 年前
profile picture
專家
已審閱 1 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南