[Feature request] Increase AWS Glue Catelog 400 KB schema size limit

0

When we use AWS Glue Crawler to register one Delta table in S3 into AWS Glue Catelog.

However, we got error

ERROR : Internal Service Exception

After reading https://repost.aws/knowledge-center/glue-crawler-internal-service-exception, I think the issue is caused by 400 KB schema size limit from AWS Glue Catelog.

Your crawler might fail with an "Unable to create table in Catalog" or "Payload size of request exceeded limit" error message. When this happens, monitor the size of the table schema definition. There's no limitation on the number of columns that a table in the Data Catalog can have. But, there is a 400 KB limit on the total size of the schema. A large number of columns contributes to the total schema size exceeding the 400 KB limit. Potential workarounds include breaking the schema into multiple tables and removing the unnecessary columns. You can also consider decreasing the size of metadata by reducing column names.

We have a lot of time serial sensor data in this Delta table with near 9,000 columns.

I tried to get first 10 chars of SHA-512 of original column name as new column names and write a new Delta table. For example, original column name is very_very_very_very_looooong_column_name, the new column name would be first 10 chars of its SHA-512 hash which is 23140d3feb. This should help significantly decrease the final schema size.

However, when I try to use AWS Glue Crawler to register this new Delta table with new column names in S3 into AWS Glue Catelog, I still got same error.

I have another very similar table but only 2,000 columns which succeed using AWS Glue Crawler to register one Delta table in S3 into AWS Glue Catelog. So it makes me highly doubt the AWS Glue Catalog schema size limit caused the issue.

Feature Request 1

I am wondering if the AWS team help increase the 400 KB limit? Because for time serial sensor data, the number of columns can easily have a lot. Thanks!


UPDATE 8/30/2023

I made my column name even shorter to first 7 chars of its SHA-512 hash (e.g. very_very_very_very_looooong_column_name -> 23140d3) and write to a new Delta table in S3. Now when I run AWS Glue crawler to scan this new Delta table, the error becomes

2023-08-30T19:26:16.049-07:00	[25d18b71-6405-4464-b76a-a079413c643d] BENCHMARK : Running Start Crawl for Crawler my-delta-lake-crawler--data
2023-08-30T19:26:37.842-07:00	[25d18b71-6405-4464-b76a-a079413c643d] BENCHMARK : Classification complete, writing results to database my_delta_db
2023-08-30T19:26:37.870-07:00
[25d18b71-6405-4464-b76a-a079413c643d] INFO : Crawler configured with Configuration 
{
    "Version": 1,
    "CrawlerOutput": {
        "Partitions": {
            "AddOrUpdateBehavior": "InheritFromTable"
        }
    },
    "CreatePartitionIndex": true
}
 and SchemaChangePolicy 
{
    "UpdateBehavior": "UPDATE_IN_DATABASE",
    "DeleteBehavior": "DELETE_FROM_DATABASE"
}
. Note that values in the Configuration override values in the SchemaChangePolicy for S3 Targets.
[25d18b71-6405-4464-b76a-a079413c643d] INFO : Crawler configured with Configuration {"Version":1.0,"CrawlerOutput":{"Partitions":{"AddOrUpdateBehavior":"InheritFromTable"}},"CreatePartitionIndex":true} and SchemaChangePolicy {"UpdateBehavior":"UPDATE_IN_DATABASE","DeleteBehavior":"DELETE_FROM_DATABASE"}. Note that values in the Configuration override values in the SchemaChangePolicy for S3 Targets.

2023-08-30T19:26:38.692-07:00	[25d18b71-6405-4464-b76a-a079413c643d] ERROR : Payload size of request exceeded limit (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 076b2886-63bf-4a8f-81d6-6a372d3b8693; Proxy: null)
2023-08-30T19:27:47.341-07:00	[25d18b71-6405-4464-b76a-a079413c643d] BENCHMARK : Crawler has finished running and is in state READY

Feature Request 2

Could the AWS team also resolve the payload size issue (some random thoughts, such as auto split to multiple requests)? Thank you so much!


UPDATE 9/1/2023

I made my column name even shorter to 3 chars (e.g. very_very_very_very_looooong_column_name -> a12), not related to hashing. A way I guarantee it is unique.

  • The first character is a lowercase letter
  • The remaining two characters are either a lowercase letter or number This can stand for up to 26 * 36 * 36 = 33696 fields. And I cannot make it further shorter because 26 * 36 = 936 < our 9000 fields.

Then write to a new Delta table in S3. When I run AWS Glue crawler to scan this new Delta table, the error still shows

ERROR : Payload size of request exceeded limit


UPDATE 9/6/2023

Using Amazno Athena to create Delta table

create external table awsdatacatalog.my_delta_db.my_table
location 's3://integrated-test-data-archer/data/delta-tables/my_table/'
tblproperties (
    'table_type' = 'delta'
);

will also give error

```shell
Delta Table DDL failed with an unexpected error.
This query ran against the "itg_delta_db" database, unless qualified by the query. Please post the error message on our forum 
or contact customer support 
with Query Id: 8e542a72-d557-423c-a463-a570acb88d72

We succeed splitting Delta table based on columns. However, unfortunately, simply filter in each group and then join all these separate tables already makes our SQL query extremely long, which is quite not user friendly.

profile picture
asked 8 months ago141 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions