I have a need to keep consistency of schemas across environments and I am after a recommendation on best practice.
I understand timestream is schemaless. Schema is dynamically created according to the metrics we write to the engine. This is all good, until it isn't.
When I am evolving my application, some times, I have data that comes first to my non-production environments. The behaviour of the application should be the same across environments. If in production I don't have data, the queries will bring back no rows and the code execution ends. However, what I get, is errors in production of the type:
failed to run timestream query: operation error Timestream Query:
Query, https response error StatusCode: 400, RequestID: U3NQ24QTXMOUYPQGDWHJXPS7QU,
ValidationException: line 26:11: Column 'new_column_with_no_value_in_prod' does not exist
I am thinking adding a step in my pipeline that ensures to write a dummy record, to ensure I have the same schema across environments. Here I have plenty of options how to achieve this. I am wondering though, is there any recommended approach? Am I missing something?
Thank you Didier,
Just to be clear I understand your recommendation here:
Based on what I see in that post, the strategy is: write data to ensure schema. They have a python script that is run to populate some data first (
run.sh
).In other words, if I want to handle this in my pipeline, I would also have a step in my deployment to production, either before or after deployment, that would trigger a dummy insertion of data, to ensure the schema has been created?
My problem is that not all changes I introduce, immediately should have an effect in production, as not all the measures we are collecting are always readily available in production. However, as I explained previously, absence of data should be supported by the application, as absence of data is just an empty resultset and my application can handle that scenario.
So, is the recommendation here:
Is that correct?