field data type INT32 is incompatible with type varchar defined in table schema, when selecting random samples

0

I'm trying to query for 20 random rows (using ORDER BY RANDOM() LIMIT 20) in the table, with some conditions, but received this error message:

HIVE_BAD_DATA: Field <fieldname removed here>'s type INT32 in parquet file s3://datasets-stage-beta/<data directory removed here>.snappy.parquet is incompatible with type varchar defined in table schema
This query ran against the "<database name removed here>" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id

Just wondering how to resolve this issue

asked 2 months ago58 views
1 Answer
1

This error indicates a mismatch between the data type in your Parquet file and the schema defined in your table. Specifically, the Parquet file contains an INT32 data type for a field that is defined as VARCHAR in your table schema.

To resolve this issue, you have a few options:

  1. Update the table schema: Modify your table definition to match the data type in the Parquet file. In this case, you should change the column type from VARCHAR to INTEGER in your table definition. You can do this using an ALTER TABLE statement in Athena.

  2. Modify the data conversion process: If the data should indeed be a VARCHAR, you may need to adjust your ETL process or data pipeline to ensure that the integer data is being converted to a string format before it's written to the Parquet file.

  3. Use CAST in your query: As a temporary workaround, you can use a CAST function in your SELECT statement to convert the INT32 to a VARCHAR when querying. For example:

    SELECT CAST(your_column AS VARCHAR) AS your_column, ...

  4. Recreate the table: If you have the ability to recreate the table, you can use a CREATE TABLE AS SELECT (CTAS) query to create a new table with the correct schema while converting the data types as needed.

  5. Verify data consistency: It's possible that some files in your dataset have different schemas. You may want to check if all your Parquet files have consistent schemas and data types.

Remember, it's important to ensure that your table schema accurately reflects the data types in your Parquet files to avoid such errors and to maintain data integrity. If you're frequently dealing with changing schemas, you might want to consider using a schema evolution strategy that's compatible with your data storage and query engine.
Sources
Incompatible Issue with timestamp | AWS re:Post
Query troubleshooting in Amazon Redshift Spectrum - Amazon Redshift

profile picture
answered 2 months ago
profile picture
EXPERT
reviewed 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions