How to handle schema changes and versions in the Glue Data Catalog?

0

Given a dataset with a temporally changing schema, how do you:

  1. Specify a schema version when running a job/query? i.e. I want to specify version X of a table because I know that the data I am querying against was created when version X was current.
  2. Query data across versions? i.e. Data in January has slightly different schema version than data in February and I want to run a query across both.

For scenario 2, I've run some tests with an evolving schema and adding or dropping columns creates no issues. If the column type changes though (i.e a TIMESTAMP changes to a STRING), then queries against data that don't match the catalog fail. I've tested this behavior in both Athena and Glue (DynamicFrame can handle some data type inconsistencies).

My thought would be to do separate queries and then UNION the results, but that requires the answer to scenario 1...

asked 5 years ago4264 views
1 Answer
1
Accepted Answer

Version by itself is not available as a column to Athena. In Athena/Presto, you can use the typeOf function to determine the type of a column and choose how you want to process that. A Case statement would also work. e.g.

SELECT  x, typeof(x)  
from (select timestamp '2012-10-31 01:00 UTC' as x)
where typeof(x) like 'timestamp%'
UNION
SELECT date_parse(x,'%Y-%m-%d %h:%i:%s'), typeof(x)
from (select '2012-10-31 01:00:00' as x)
where typeof(x) like 'varchar%'

But ideally even if you have a varying schema in your raw data tier, you should be resolving that schema in the etl layer so that in the analytics tier you have a single schema with perhaps 2 columns - one for the original column, one for the changed column. Then at the serving layer it is easy to build a view to coalesce the columns if needed e.g.: if a string column is being changed to timestamp

SELECT coalesce(new_column, date_parse(old_column,'%m/%d/%Y %h:%i:%s %p')) as timestamp_column
AWS
answered 5 years ago
profile picture
EXPERT
reviewed 16 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions