Why does Redshift Spectrum do full table scans for Iceberg tables which have a nested schema?

1

I've been trying to test out Iceberg tables with Amazon Redshift Spectrum and have come across a major issue.

Here is my setup:

  1. I create an iceberg table via spark (emr 7.0) and insert data across multiple partitions
  2. The table uses the Glue catalog
  3. I add the glue database to Spectrum

Here is an example table DDL:

CREATE TABLE iceberg_test (
  id string,
  year string,
  last_name string,
  first_name string,
  age double,
  nested struct<nested_field: string>)
PARTITIONED BY (`year`, `last_name`)
LOCATION 's3://data/iceberg/main.db/iceberg_test'
TBLPROPERTIES (
  'table_type'='iceberg',
  'write_compression'='zstd'
);

With this table, i have two partition columns, year and last_name. I insert data across multiple partitions. For example: year = 2022, last_name = smith year = 2021, last_name = bob

Whenever i issue a query against this table via spectrum, Spectrum will do a full table scan rather than read just the partitions specified in the query. An example query would be

select * from main.iceberg_test where year = 2022 and last_name = 'smith';

I've confirmed the scan behaviour by looking at my S3 access logs for the table.

The issue appears to be caused by the presence of the nested field in the table. Any table with a nested struct field will cause Spectrum to ignore the partitions and do a full table scan.

If i remove this field from the table, I can verify that spectrum only accesses the folders for the partitions in the query.

Is there some work around for this or some setting that i'm missing? I don't see any indication on the Redshift Iceberg docs about this significant limitation.

ukayani
asked 14 days ago235 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions