The AWS Glue crawler is creating multiple tables from my source data. I want to know what I can do so that it doesn't continue to happen.
Short description
During the first AWS Glue crawler run, the crawler reads the first 1000 records or the first megabyte of every file to infer the schema. The amount of data that the crawler reads depends on the file format and availability of a valid record.
For example, if the input is a JSON file, then the crawler reads the first 1 MB of the file. If the crawler reads a valid record within the first 1 MB of the file, then the crawler infers the schema. If the crawler can't infer the schema, then it reads a maximum of 10 MB of the file in 1 MB increments.
For .csv files, the crawler reads either the first 1000 records or the first 1 MB of data based on the one that comes first. For Parquet files, the crawler infers the schema directly from the file.
The crawler compares the schemas that it infers from all the subfolders and files, and then creates tables.
When your source data files don't use the same configurations for the following parameters, the crawler creates multiple tables:
- Format, such as .csv, Parquet, or JSON
- Compression type, such as SNAPPY, gzip, or bzip2
- Schema
- Structure of Amazon Simple Storage Service (Amazon S3) partitions
Resolution
Check the crawler logs
To identify the files that cause the crawler to create multiple tables, complete the following steps:
- Open the AWS Glue console.
- In the navigation pane, choose Crawlers.
- Select the crawler that you want to review.
- Choose the Logs link to view the logs on the Amazon CloudWatch console.
If AWS Glue created multiple tables during the previous crawler run, then the log includes entries that are similar to the following:
[439d6bb5-ce7b-4fb7-9b4d-805346a37f88] INFO : Created table
2_part_00000_24cab769_750d_4ef0_9663_0cc6228ac858_c000_snappy_parquet in
database glue
[439d6bb5-ce7b-4fb7-9b4d-805346a37f88] INFO : Created table
2_part_00000_3518b196_caf5_481f_ba4f_3e968cbbdd67_c000_snappy_parquet in
database glue
[439d6bb5-ce7b-4fb7-9b4d-805346a37f88] INFO : Created table
2_part_00000_6d2fffc2_a893_4531_89fa_72c6224bb2d6_c000_snappy_parquet in
database glue
The log entries include the names of the files that cause the crawler to create multiple tables.
Prevent the creation of multiple tables
To prevent the creation of multiple tables, take one of the following actions for your use case.
Confirm that the data files use the same schema, format, and compression type
In some cases, files use different schemas. For example, schema A says that field X is type INT, and schema B says that field X is type BOOL.
For this use case, use the from_options function to run an AWS Glue extract, transform, and load (ETL) job to read the outlier data. Then, transform the outlier data types to the correct or most common data types in your source.
You can also use the existing table DDL to manually create the table in Amazon Athena. Then, run an AWS Glue crawler to update the table metadata. Configure the crawler so that it doesn't overwrite the existing schema of the table.
Combine compatible schemas when you create the crawler
The crawler infers the schema at the folder level and compares the schemas across all folders. The crawler checks whether the schemas match and the partition threshold is higher than 70 percent. If they match, then the schemas are denoted as partitions of a table. If they don't match, then the crawler creates a table for every folder. This results in a higher number of tables.
Your data might have different schemas in some input files and similar schemas in other files. When you create the crawler, combine compatible schemas.
On the Configure the crawler's output page of the AWS Glue console, under Grouping behavior for S3 data (optional), choose Create a single schema for each S3 path. If the data is compatible, then the crawler ignores the similarity of schemas when it evaluates Amazon S3 objects in the path.
For more information, see Creating a single schema for each Amazon S3 include path.
Check if your input files have different Amazon S3 paths
When the structure that's inside the Amazon S3 prefix isn't consistent, then the crawler assumes each path as an individual table. The crawler then creates multiple tables. If your input files have different Amazon S3 structures or paths, then the crawler creates multiple tables by default.
For example, you run a crawler on the s3://doc-example-bucket/doc-example-key/doc-example-table Amazon S3 path with a partition structure that's similar to the following:
- s3://doc-example-bucket/doc-example-key/doc-example-table/dt=2020-08-20/doc-example-file1.csv
- s3://doc-example-bucket/doc-example-key/doc-example-table/dt=2020-08-21/dox-example-file2.csv
- s3://doc-example-bucket/doc-example-key/doc-example-table/dt=2020-08-22/doc-example-file3.csv
You then add the following files to the preceding three files:
- s3://doc-example-bucket/doc-example-key/doc-example-table/dox-example-file4.csv
- s3:// doc-example-bucket/doc-example-key/doc-example-table/doc-example-file5.csv
The crawler creates five individual tables in another crawler run because the folder partition structure is inconsistent.
To avoid this issue, either manually or programmatically confirm that the schema is consistent. In the preceding example, you delete the Amazon S3 files without the dt=xxxx-xx-xx partition, or you add the partition for files doc-example-file4.csv and doc-example-file5.csv. To exclude unwanted files or folders, use an exclude pattern.
Use consistent headers
When you use data that's in .csv format, consistently use headers. If some of your files have headers and other files don't, then the crawler creates multiple tables.
Delete multiple tables that a crawler run created
Filter the tables from the crawler run in your CloudWatch logs. Then, call the BatchDeleteTable API to delete the tables.
Complete the following steps:
- Open the AWS Glue console.
- In the navigation pane, choose Crawlers.
- Under Crawler runs, select the target crawler name.
- Choose View run details.
- On the Crawler run details page, note the crawler run's ID to use in the BatchDeleteTable call.
- Choose CloudWatch logs, and then select the /aws-glue/crawlers log group.
- Choose Actions, and then choose Export data to Amazon S3.
- Enter the following information to export the data:
For the time range, use the crawler run's start and end time.
For the stream prefix, add the crawler name.
Select your target S3 bucket and prefix to export the crawler run logs.
- For the /aws-glue/crawlers log group, choose Actions, and then choose View all exports to Amazon S3.
- Select the export task for the /aws-glue/crawlers log group and the target S3 bucket.
- Choose View in Amazon S3.
- On the Amazon S3 console, choose the folder with the target crawler name.
- Download the exported log file in .gz format to your local machine or an Amazon Elastic Compute Cloud (Amazon EC2) instance. Then, decompress the file to a local path. Note the path location to use in the BatchDeleteTable call.
- To filter the tables, scan the decompressed file from your local machine or EC2 instance.
- Call AWS Glue BatchDeleteTable API to delete the tables.
Example script:
import boto3
import re
# input parameters
catalogId='CatalogId'
database = 'databaseName'
crawlId= 'crawlRunId'
logFilePath = 'local_decompressed_file_path'
# function to extract the created tables by the given crawl run id in target database
def extract_table_info(file_path, target_crawl_id, target_database):
list_tables = []
with open(file_path, 'r') as file:
for line in file:
match = re.search(f'.*\\[{target_crawl_id}\\] INFO : Created table (\\w+(?:_[a-f0-9]+)*) in database {target_database}', line)
if match:
table_name = match.group(1)
list_tables.append(table_name)
return list(set(list_tables))
createdTables = extract_table_info(logFilePath, crawlId, database)
# initialize the Glue client and make batch_delete_table API calls to delete the created tables
client = boto3.client('glue')
for i in range(len(createdTables) // 100 + 1):
tables_to_delete = createdTables[100 * i : 100 * (i + 1)]
response = client.batch_delete_table(
CatalogId=catalogId,
DatabaseName=database,
TablesToDelete=tables_to_delete
)
print(i)
print(response)
Note: Replace the example input parameters with your input parameters. The preceding example script uses the default AWS Region setting on your local machine or EC2 instance. To change the Region in boto3, see Configuration.
Related information
Managing partitions for ETL output in AWS Glue
Customizing crawler behavior
Export log data to Amazon S3 using the console