Error message while querying partition with parquet format

0

Query Id: c4e3e87f-45b1-4ad1-a0fa-fadf179d6cbd

HIVE_BAD_DATA: Not valid Parquet file: s3://sherlock-inventory-usamazon/spear-prod-usamazon-inventory/Refund/dt=2022-12-28/sea-events-batched-archival_2022-12-28_2022-12-29_c7dc1157-9b94-468e-be84-d171a7404981.parquet expected magic number: PAR1 got: ��u�

已提问 1 年前2717 查看次数
1 回答
0

Please let us know how this parquet file was generated. The main reason for these issues is the different ways parquet files can be created, and some of those are not compatible with Athena. Athena uses the Hive parquet SerDe (org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe). As a result, the SerDe expects that all columns will be present in the source parquet file. The parquet format generated by some packages allow for the final parquet file to exclude columns if that column is blank in the data. For example, if a record does not have any value for the "x" column, then the "x" column is omitted from the actual parquet file itself.

When you try reading this file through Athena then it will attempt to read the metadata first and then the actual data. Here are a few suggestions for you to troubleshoot:

  • Try changing the Athena Engine version(Under Amazon Athena > Workgroups > Manual > V3 Engine).
  • Use S3 Select in the S3 console to see if the data is formatted correctly.
  • Download this file into a Linux/Mac console and use parquet-tools to confirm the file is in valid parquet format.
  • Check the SerDe defined in the Table DDL and ensure you are using the right SerDe.
  • Format your data in AWS Glue (ETL Programming) then write to parquet file or directly into Catalog table defined as parquet.

Ref Links:

If this helped, please accept answer or upvote for everyone's benefit.

profile pictureAWS
已回答 1 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则