Hive Bad Data error

0

Objective: Convert json file to parquet whenever a file lands in S3 bucket via lambda function. First I tested my lambda function( with a single file) and created a schema of which I changed the schema. Now since there are 10 json files and each converted into parquet format. When I run query against that I get the below error:

TYPE_MISMATCH: Unable to read parquet data. This is most likely caused by a mismatch between the parquet and metastore schema .This query ran against the "de-yt-clean" database, unless qualified by the query.

I understand the error, and also I have changed the schema of the table but still I keep getting this error.Enter image description here

Note: "id" column data type earlier is captured as string but needs to be in bigint

asked 2 months ago154 views
2 Answers
0

Verify Schema Changes: Double-check that the schema changes you made are correctly reflected in the Parquet files. When converting JSON to Parquet, ensure that the conversion process adheres to the new schema.

  1. Schema Evolution: If you've changed the schema, ensure that you're handling schema evolution properly. Parquet supports schema evolution, but you need to ensure that your conversion process handles this correctly. If the changes are backward-incompatible, you may need to handle schema migration or adjust your query accordingly.

  2. Metastore Sync: Make sure that your metastore schema is in sync with the actual schema of the Parquet files. If the metastore schema is outdated or incorrect, it can cause issues when querying the Parquet files. You may need to update the metastore schema to match the current schema of the Parquet files.

3)Query Configuration: Ensure that your query is correctly configured to read the Parquet files with the updated schema. If you're explicitly specifying the schema in your query, make sure it matches the actual schema of the Parquet files.
4)Testing: Test your entire pipeline with a single file again to ensure that schema changes are properly applied and that querying works as expected. Once you confirm that it works with a single file, you can proceed with processing multiple files.

5) Error Logging: Implement error logging in your Lambda function to capture any errors that occur during the conversion process. This can help identify any issues that may arise during schema conversion or Parquet file generation.

By following these steps and ensuring that both your Parquet files and metastore schema are correctly updated to reflect the schema changes, you should be able to resolve the TYPE_MISMATCH error when querying the Parquet files.

profile picture
answered 2 months ago
0

If changing the table schema doesn’t resolve your error, you may need to also update your existing data files with the changes as well. You can do this with the following steps:

  1. Create and insert data from the old table into a new table. with converting data types on the right columns to match the new schema.
CREATE TABLE new_table_name 
WITH (
    format = 'PARQUET',
) 
SELECT
    kind,
    etag,
    snippet_channelid,
    snippet_title,
    snippet_assignable,
    CAST(id AS BIGINT) AS id
FROM old_table_name;
  1. After you’ve verified you’re able to query the new table, you can drop the old table and rename the new table to the previous name.
DROP TABLE old_table_name;
ALTER TABLE new_table_name RENAME TO old_table_name;

Make sure the data coming in from the source or the Lambda function also matches the correct schema, otherwise, you might see the error again. There should be various libraries available you can use to convert to the right data type. If you’re using Python, pyarrow is one library you can use to specify a schema and write your data as Parquet files in S3.

Helpful resources:

profile pictureAWS
EXPERT
answered 10 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions