HIVE_CANNOT_OPEN_SPLIT: can not read class org.apache.parquet.format.ColumnIndex

0

HI I have a paquet file which I am trying to read in Athena but is giving below error:

SELECT * FROM "xxx_db"."bot_info" where bot_alias='test_bot';

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://xxxxxx/xxxxxx/test_bot.parquet (offset=0, length=1198): java.io.IOException: can not read class org.apache.parquet.format.ColumnIndex: Required field 'null_pages' was not present! Struct: ColumnIndex(null_pages:null, min_values:[74 65 73 74 5F 62 6F 74], max_values:[74 65 73 74 5F 62 6F 74], boundary_order:null)

But when I try query SELECT * FROM "xxx_db"."bot_info" limit 10. It works totally fine.

Rishabh
質問済み 1年前349ビュー
1回答
0

Greetings from AWS! The error message indicates that the parquet file s3://xxxxxx/xxxxxx/test_bot.parquet contains records that mismatch the table schema because "field 'null_pages' was not present". It is possible that "SELECT * FROM "xxx_db"."bot_info" limit 10." works because with LIMIT 10, the query does not scan the whole dataset and only return first 10 records (randomly) from the dataset. On the other hand, when ran "SELECT * FROM "xxx_db"."bot_info" where bot_alias='test_bot';" the query needs to scan the whole table (or partition), and such gave the error when it scanned the problematic records which don't match the table schema.

In order to fix this issue, I'd suggest you to download and double check the schema of the parquet files s3://xxxxxx/xxxxxx/test_bot.parquet with "parquet-tools" an open source command line tool provided by Apache, and ensure the table schema and files match with each other. I hope this information helps!

AWS
Ethan_H
回答済み 1年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ