By using AWS re:Post, you agree to the Terms of Use

Data Catalog schema table getting modified when I run my Glue ETL job

0

I created a Data Catalog with a table that I manually defined. I run my ETL job and all works well.

I added partitions to both the table in the Data Catalog, as well as the ETL job. it creates the partitions and I see the folders being created in S3 as well. But my table data types change. I originally had:

columndata type
vidstring
altidstring
vtypestring
timetimestamp
timegmttimestamp
valuefloat
filenamestring
yearint
monthint
dayint

But now after the ETL job with partitions, my table ends up like so:

columndata type
vidstring
altidstring
vtypestring
timebigint
timegmtbigint
valuefloat
filenamestring
yearbigint
monthbigint
daybigint

Before this change of data types, I could do queries in Athena. Including a query like this:

SELECT * FROM "gp550-load-database"."gp550-load-table-beta" WHERE vid IN ('F_NORTH', 'F_EAST', 'F_WEST', 'F_SOUTH', 'F_SEAST') AND vtype='LOAD' AND time BETWEEN TIMESTAMP '2021-05-13 06:00:00' and TIMESTAMP '2022-05-13 06:00:00'

But now with the data types change, I get an error when trying to do a query like above

"SYNTAX_ERROR: line 1:154: Cannot check if bigint is BETWEEN timestamp and timestamp
This query ran against the "gp550-load-database" database, unless qualified by the query. Please post the error message on our forum  or contact customer support with Query Id: 2a5287bc-7ac2-43a8-b617-bf01c63b00d5"

So then if I go into the the table and change the data type back to "timestamp", I then run the query and get a different error:

"HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'time' in table 'gp550-load-database.gp550-load-table-beta' is declared as type 'timestamp', but partition 'year=2022/month=2/day=2' declared column 'time' as type 'bigint'.
This query ran against the "gp550-load-database" database, unless qualified by the query. Please post the error message on our forum  or contact customer support with Query Id: f788ea2b-e274-43fe-a3d9-22d80a2bbbab"

With Glue Crawlers, you can influence how the partitions are created. You can have the Crawler modify the Data Catalog table, or not make changes to the table scheme, other than adding new partitions:

{
    "Version": 1.0,
    "CrawlerOutput": {
       "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" },
       "Tables": {"AddOrUpdateBehavior": "MergeNewColumns" }
    }
}

basically, this InheritFromTable behavior is what I am looking for with ETL jobs.

Does anyone know what is happening?

  • Another thing I have seen. Apparently the command additionalOptions = {"enableUpdateCatalog": True}

    causes the Data Catalog table schema to be updated. I can tell the table not to be updated by doing: additionalOptions = {"enableUpdateCatalog": True, "updateBehavior": "LOG"}

    in which case the partitions are created on s3 data store, but not in the data catalog metadata. But then my queries fail, since the metadata has to be aware of the partitions. I want the partitions to be updated in the Data Catalog, but I don't want it to change my Data Types of the Table schema. I just want it to add partitions.

  • You mentioned you are manually defining your catalog. Is this a requirement? In what format is the source data? (CSV, JSON, Parquet, ORC, AVRO...)

    Are your S3 folders Hive partitioned in the initial crawl? In your case, are you defining the partitions as "/year=2022/month=02/day=02/"?

    When the data is crawled in this format, the partitions are automatically detected by Glue and you can correct the datatypes in the Glue editor. Subsequent data added in that format will require you run this statement in Athena to add the data as desired into the target table(s):

    MSCK REPAIR TABLE {your_table_name};

    In your case this would be: MSCK REPAIR TABLE 'gp550-load-database.gp550-load-table-beta'

    Once that has been successfully executed, you should be able to view the added data in your table in Athena and the data types or schema definition you customized will be maintained.

    Any time you add partitions you must run the statement above for those partitions to be recognized and the data visible in Athena. Any time you add new data under the same partition schema Glue and Athena considers these new partitions and the statement above must be executed to add the partitioned data to your Catalog.

    I hope this helps.

    Jay

  • The table schema is fixed. So no its not a "requirement" that I manually define the table schema, it just seemed to make sense. The data source is CSV

    The initial time the ETL job runs (This is a Glue ETL job, not crawler), the Hive Partitions are created properly, no issues in the creation of the Hive Partitions.

    I realize I can run MSCK REPAIR TABLE, but this ETL job runs hourly, and my partitions are year/month/day so that means that every day I would need to run that, and it seems like a messy way to do things.

    The main issue is that its changing the field "time" from datatype "timestamp" to "bigint". This breaks all my queries. I wish there was a way to have it use data type timestamp, and not try to change this.

  • The functionality exists in Glue Crawlers to have them add partitions to the metastore and NOT update the table schema datatypes. I figured there must be a way in Glue ETL to do this as well. Perhaps I will need to run a a Crawler to populate the the metadata first (maybe at 1am) and then run the ETL job. I hope thats not the case, but I am at a loss of what to do. I wish Glue ETL had a way to tell its partition handler to "inheritFromTable"

  • It would seem one fix would be to set the update to LOG: additionalOptions = {"enableUpdateCatalog": True, "updateBehavior": "LOG"}

    Then at the end of the Glue ETL Job execute DDL MSCK REPAIR TABLE 'gp550-load-database.gp550-load-table-beta'

    Is there a straight forward way to execute Athena DDL in a glue job? I have read posts like this: https://stackoverflow.com/questions/64775753/how-to-run-arbitrary-ddl-sql-statements-or-stored-procedures-using-aws-glue

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions