Recurring Issue with AWS Athena When Running Queries on Iceberg Table.
Description
I am trying to run queries on an Iceberg table using AWS Athena. The data is stored in S3, and I am using EMR 6.12.0, Iceberg 1.3.0-amzn-0, and Spark 3.4.0. The data ingestion process is running on EMR, which consumes data from a Kafka topic and ingests it into my Iceberg table in S3. Interestingly, sometimes the query runs successfully, but other times I encounter the following error in Athena:
ICEBERG_CANNOT_OPEN_SPLIT: Error opening Iceberg split my_s3_path/data/id_pk_bucket=2/created_at_month=2023-08/my_parquet.parquet (offset=4, length=16038): Incorrect file size (16042) for file (end of stream not reached): my_s3_path/data/id_pk_bucket=2/created_at_month=2023-08/my_parquet.parquet
The error occurs only in Athena; running a query on the table using Spark works fine.
Steps to Reproduce
Configured EMR with version 6.12.0 and Spark 3.4.0.
Set up an ingestion process on EMR to consume data from a Kafka topic and insert it into an Iceberg table on S3.
Created an Iceberg table on S3 using Iceberg version 1.3.0-amzn-0 and the following properties:
OPTIONS (
'format-version'='2',
'write.target-file-size-bytes'='124217728',
'history.expire.max-snapshot-age-ms'='172800000'
PARTITIONED BY (bucket(10, my_pk), months(created_at))
)
Data write process executed in Spark:
query = (
df.writeStream.format("iceberg")
.outputMode("append")
.trigger(once=True)
.option("path", iceberg_table)
.option("fanout-enabled", "true")
.option(
"checkpointLocation",
checkpoint_location,
)
)
query.toTable(iceberg_table).awaitTermination()
Tried running a query in AWS Athena.
SELECT * FROM "db"."table" limit 10;
Expected Result
I expected the query in AWS Athena to run without any issues.
Actual Result
I am receiving a recurring error, ICEBERG_CANNOT_OPEN_SPLIT, which appears to indicate there is an issue with the file size or with the data streaming from S3.
Additional Information
EMR Version: 6.12.0
Iceberg Version: 1.3.0-amzn-0
Spark Version: 3.4.0
We are using Glue as the catalog
I am open to providing more information as needed. Thank you!
did u find the answer? can u pls let me know?