Partition schema mismatch in Glue Table

Question

Hi Team,

We have a dataset coming in from the source team which keeps on changing on a daily basis for ex: one day the dataset would have 100 columns, another day it would have 92 columns and the dataset is in csv format(tab separated). We have created a glue crawler to crawl all the files at once, the problem is we are able to load the dataset in glue Schema/table, but data within the table is misaligned meaning records are not aliigned with the column names as it should.

While creating the glue crawler, I have passed below configuration:

In Grouping behavior for S3 data (optional): I have check marked **Create a single schema for each S3 path **
In Configuration options (optional)
During the crawler run, all schema changes are logged.

When the crawler detects schema changes in the data store, how should AWS Glue handle table updates in the data catalog?
-> Add new columns only.
-> Update all new and existing partitions with metadata from the table.

How should AWS Glue handle deleted objects in the data store?
-> Mark the table as deprecated in the data catalog.

Could you please help me out how to fix this misalignment problem. Thank you in advance.

Regards,
Apurva

Answer

To fix the misalignment problem with the dataset where the records are not aligned with the column names in the Glue table, you can follow these steps:

Modify the Glue Crawler Configuration:
In the Glue Crawler configuration, you have selected the option to create a single schema for each S3 path. This means that the crawler will create a single schema for all the files it crawls in the S3 path. However, since the dataset keeps changing daily with varying column counts, this configuration might not be suitable.
To resolve this, you should modify the Glue Crawler configuration to create a separate schema for each file. This way, each file's schema will be treated individually, and changes in column counts from one file to another won't cause misalignment.

Run the Modified Glue Crawler:
After modifying the configuration, run the Glue Crawler again. This time, it will create separate schemas for each file it crawls, ensuring that the data aligns correctly with the column names.

Update the Glue Table:
Once the modified Glue Crawler has finished crawling, check the generated Glue table. Each file should have its own schema associated with it. You might see multiple tables in the Glue Data Catalog, one for each file. Review the table schemas and verify that the records align properly with the column names.

Querying the Data:
When querying the data, you need to specify the appropriate Glue table that corresponds to the specific file you want to query. By selecting the relevant table, you ensure that the data alignment is correct, and the query results match the expected schema.

By creating separate schemas for each file during the Glue Crawler run, you allow for varying column counts in the dataset without causing misalignment issues. Each file will have its own table with its specific schema, making it easier to query and work with the data.

Remember to re-run the Glue Crawler whenever new files with different column counts are added to the S3 path to ensure that the new schemas are generated and the tables stay up to date with the dataset changes.

Partition schema mismatch in Glue Table

相关内容