Skip to content

AWS crawler does not detect new schema / parquet columns

0

We have an S3 bucket partitioned as:

S3://ID=123/date=2025-06-03/data.parquet

Where in each date partition there is only one data.parquet file. There are a few hundred different ID's. For each ID we have a new parquet file incoming about everyday.

Sometimes columns are added to new incoming data (old parquet files are unchanged), to detect new partitions (date and possibly new IDs) we use a aws data catalog crawler. It is setup to run once a day and update both the schema and partitions.

Recently, about 10 columns were added to all new incoming parquet files but the data crawler does not seem to detect this so the schema is not updated, but the partition is so we can use Athena to query the parquet file but not the added columns.

We have tried to delete the table and run the crawler as well as setting the data source for the crawler to only one of the new files but it still does not detect the new columns.

How can we preferably make the crawler detect the new schema? Preferably in an automatic fashion. If that is not possible, what is the recommended approach to create a table/schema that can be used by Athena?

Best Regards, Andreas

asked a year ago355 views
1 Answer
0

When AWS Glue crawlers don't detect schema changes in your Parquet files, there are several potential solutions to address this issue.

First, check your crawler's configuration for schema change handling. In the AWS Glue console, go to your crawler's settings under "Set output and scheduling" and review the "Advanced options" section. Make sure it's set to "Update the table definition in the data catalog" rather than "Add new columns only" or "Ignore the change and don't update the table in the data catalog."

If your crawler is already configured correctly but still not detecting the schema changes, try these approaches:

  1. Delete the table from the Data Catalog and run the crawler again. When you recreate the table using DDL in Athena, the crawler might not recognize it as the same table it previously created, which could cause issues with schema updates.

  2. Try pointing the crawler directly to a specific partition with the new schema (e.g., a specific date folder with the updated Parquet file) rather than the entire bucket.

  3. Check if the new columns are actually present in the Parquet files by using a tool like AWS Glue Studio or a custom Glue job to inspect the files directly.

  4. Consider creating a custom AWS Glue ETL job that reads the Parquet files and explicitly updates the schema in the Data Catalog based on what it finds.

If the crawler continues to have issues, you can manually update the table schema in the AWS Glue Data Catalog to include the new columns, or create a new table definition with the complete schema using Athena DDL statements.

For a more automated approach in the future, you might want to implement a solution using AWS Glue ETL jobs that automatically detect schema changes and update the Data Catalog accordingly, rather than relying solely on the crawler functionality.
Sources
Configure a crawler to handle schema changes | AWS re:Post
AWS Glue crawler creating new version glue data catalog tables when no change in schema | AWS re:Post
crawler can not find the table | AWS re:Post
Preventing a crawler from changing an existing schema - AWS Glue

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.