I wrote an Iceberg table using Athena, and stored it into an S3 bucket. Data files were written using Parquet file format. After downloading it and trying to select the data I wrote into it using pyarrow, it fails. It seems that Athena writes an invalid encoding of data
>>> import pyarrow.parquet as pq
>>> table = pq.read_table('1ef2a2f6-87f2-4ab9-845e-c7e85d68866c.snappy.parquet')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ec2-user/.local/lib/python3.7/site-packages/pyarrow/parquet/__init__.py", line 2828, in read_table
use_pandas_metadata=use_pandas_metadata)
File "/home/ec2-user/.local/lib/python3.7/site-packages/pyarrow/parquet/__init__.py", line 2475, in read
use_threads=use_threads
File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Malformed levels. min: 24 max: 24 out of range. Max Level: 1
You can reproduce the issue by doing the above with a parquet file written by the statements
CREATE TABLE athena_table (x int) LOCATION 's3://<your-bucket>/<dir>/'
TBLPROPERTIES (
'table_type' = 'ICEBERG',
'format' = 'parquet',
'write_compression' = 'snappy'
);
insert into athena_table values(43),(43),(43),(43),(43),(43),(43),(43);