How can I prevent the AWS Glue crawler from creating multiple tables?
The AWS Glue crawler is creating multiple tables from my source data, and I want to prevent this from happening.
During the first AWS Glue crawler run, the crawler reads either the first 1000 records or the first megabyte of every file. It does this to infer the schema. The amount of data that the crawler reads depends on the file format and availability of a valid record.
For example, if the input is a JSON file, then the crawler reads the first 1 MB of the file to infer the schema. If the crawler reads a valid record within the first 1 MB of the file, then the crawler infers the schema. If the crawler can't infer the schema after reading the first 1 MB, then it reads a maximum of 10 MB of the file. It does this in 1-MB increments.
For CSV files, the crawler reads either the first 1000 records or the first 1 MB of data, whichever comes first. For Parquet files, the crawler infers the schema directly from the file.
The crawler compares the schemas that it infers from all the subfolders and files, and then creates one or more tables.
The AWS Glue crawler creates multiple tables when your source data files don't use these same configurations:
- Format (such as CSV, Parquet, or JSON)
- Compression type (such as SNAPPY, gzip, or bzip2)
- Structure of Amazon Simple Storage Service (Amazon S3) partitions
Check the crawler logs
Check the crawler logs to identify the files that are causing the crawler to create multiple tables:
1. Open the AWS Glue console.
2. In the navigation pane, choose Crawlers.
3. Choose the crawler that you want to review the logs for.
4. Choose the Logs link to view the logs on the Amazon CloudWatch console.
5. If AWS Glue created multiple tables during the previous crawler run, then the log includes entries similar to the following entries:
[439d6bb5-ce7b-4fb7-9b4d-805346a37f88] INFO : Created table 2_part_00000_24cab769_750d_4ef0_9663_0cc6228ac858_c000_snappy_parquet in database glue
[439d6bb5-ce7b-4fb7-9b4d-805346a37f88] INFO : Created table 2_part_00000_3518b196_caf5_481f_ba4f_3e968cbbdd67_c000_snappy_parquet in database glue
[439d6bb5-ce7b-4fb7-9b4d-805346a37f88] INFO : Created table 2_part_00000_6d2fffc2_a893_4531_89fa_72c6224bb2d6_c000_snappy_parquet in database glue
The log entries include the names of the files that are causing the crawler to create multiple tables.
Prevent creation of multiple tables
To prevent the creation of multiple tables, take one of the following actions:
Confirm that the all data files use the same schema, format, and compression type
In some cases, files use different schemas. For example, schema A might say that field X is type INT, and schema B says that field X is type BOOL. In this use case, take one of the following actions:
- Use the from_options function to run an AWS Glue extract, transform, and load (ETL) job to read the outlier data. Then, transform the outlier data types to the correct or most common data types in your source.
- Use the existing table DDL to manually create the table in Amazon Athena. Then, run an AWS Glue crawler to update the table metadata. Use the crawler setting to inherit the schema from the table.
Combine compatible schemas when you create the crawler
The crawler infers the schema at folder level and compares the schemas across all folders. The crawler checks if the schemas match and that the partition threshold is higher than 70 percent. If they match, then the schemas are denoted as partitions of a table. If they don’t match, then the crawler creates a table for every folder, resulting in a higher number of tables.
Your data might have different schemas in some input files and similar schemas in other files. When you create the crawler, you can combine compatible schemas. On the Configure the crawler's output page, under Grouping behavior for S3 data (optional), choose Create a single schema for each S3 path. When you activate this setting and the data is compatible, the crawler ignores the similarity of schemas when evaluating Amazon S3 objects in the path.
For more information, see How to create a single schema for each Amazon S3 include path.
Check if your input files have different Amazon S3 paths
When the structure inside the Amazon S3 prefix isn’t consistent, the crawler assumes each individual path as an individual table. The crawler then creates multiple tables. If your input files have different Amazon S3 structures or paths, then the crawler creates multiple tables. This is the crawler’s default behavior.
For example, a crawler crawls an Amazon S3 path s3://doc-example-bucket/doc-example-key/doc-example-table with a partition structure similar to the following structure:
You add the following additional files to the previous three files:
- s3:// doc-example-bucket/doc-example-key/doc-example-table/doc-example-file5.csv
The crawler then creates five individual tables in another crawler run. This happens because the folder partition structure is inconsistent.
To avoid this issue, review these best practices:
- Make sure that the schema is consistent, either manually or programmatically. In the preceding example, either delete the Amazon S3 files without the partition dt=xxxx-xx-xx, or add the partition for files doc-example-file4.csv and doc-example-file5.csv.
- Use an exclude pattern to exclude unwanted files or folders.
Use consistent headers
When using CSV data, use headers consistently. If some of your files have headers and some don't, then the crawler creates multiple tables.