Basically i have a s3 bucket that i use to store some events, and then i use glue crawlers and athena to query the data. I was trying to use hive partitions like this: year=yyyy/month=MM/day=dd and use a compatible schema for all partitions (i saw this on the options when creating a crawler). I assume this is the intended way to use it, since querying 5 tables to get the same data because of a missing fields seems unpractical/undoable to me. The crawler creates only one table with the correct schema (i use a mock file with all possible values), but when i do this query:
SELECT * FROM "TableA" WHERE DATE(CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0'))) BETWEEN DATE('2023/08/08') AND DATE('2023/08/08') AND "TableA"."timestamp" BETWEEN 1691500345 AND 1691501245 ORDER BY "timestamp" DESC
I get this error: "HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'cloud_event_management' in table 'default.TableA' is declared as type 'boolean', but partition 'year=2023/month=08/day=03' declared column 'rule_metadata_updated_at' as type 'string'."
i have checked the table columns and they seem ok, the only thing i noticed is that in the "year=2023/month=08/day=03" partition there was not a single file containing the "cloud_event_management" property. Could this be the error?
Is there any cost effective way to remediate this?
If it helps, the s3 files are files that contains multi-line json, something like this:
"
{"agent_id":"SomeAgent", "data_id":"mydataa"...}
{"agent_id":"SomeAgent2", "data_id":"mydataa"...}
"