Skip to content

How does the AWS Glue crawler detect the schema?

4 minute read
0

When I run an AWS Glue crawler, the crawler creates multiple tables with schemas that look similar. I want to know how the crawler detects the schema.

Resolution

Schema detection in crawler

During the first crawler run, the crawler reads either the first 1,000 records or the first megabyte of each file to infer the schema. The amount of data that the crawler reads depends on the file format and availability of a valid record. For example, if the input file is a JSON file, then the crawler reads the first 1 MB of the file to infer the schema.

If the crawler reads a valid record within the first 1 MB of the file, then the crawler infers the schema. If the crawler can't infer the schema after 1 MB, then it continues to read up to 10 MB of the file in increments of 1 MB.

For CSV files, the crawler reads either the first 1000 records or the first 1 MB of data, whichever comes first. For Parquet files, the crawler directly infers the schema from the file. The crawler compares the schemas that it inferred from all the subfolders and files, and then creates one or more tables.

When a crawler creates a table, the crawler checks the following factors:

  • The data is of the same format, compression type, and included path
  • How similar the schemas are in partition threshold and number of different schemas

For a crawler to consider schemas similar, the following conditions must be true:

  • The partition threshold is higher than 0.7 (70%).
  • The maximum number of different schemas, also referred to as "clusters" in this context, doesn't exceed five.

The crawler infers the schema at folder level and compares the schemas across all folders. If the compared schemas match with a partition threshold that's higher than 70%, then the crawler denotes the schemas as partitions of a table. If they don't match, then the crawler creates a table for each folder that results in a higher number of tables.

Example scenarios

Example 1

In the following example, the folder DOC-EXAMPLE-FOLDER1 has 10 files, eight files with schema SCH_A and two files with SCH_B.

The files are similar to the following examples:

SHC_A:

{ "id": 1, "first_name": "John", "last_name": "Doe"}{ "id": 2, "first_name": "Li", "last_name": "Juan"}

SCH_B:

{"city":"Dublin","country":"Ireland"}{"city":
"Paris","country":"France"}

When the crawler crawls the Amazon Simple Storage Service (Amazon S3) path s3://DOC-EXAMPLE-FOLDER1, the crawler creates one table. The table comprises columns of both schema SCH_A and SCH_B. This is because 80% of the files in the path belong to the SCH_A schema and 20% of the files belong to the SCH_B schema. So, the schema meets the partition threshold value. Also, the number of different schemas doesn't exceed the number of clusters, and the schema doesn't exceed the cluster size limit.

Example 2

In the following example, the folder DOC-EXAMPLE-FOLDER2 has 10 files, seven files with the schema SCH_A and three files with the schema SCH_B.

When the crawler crawls the Amazon S3 path s3://DOC-EXAMPLE-FOLDER2, the crawler creates one table for each file. This is because 70% of the files belong to the schema SCH_A and 30% of the files belong to the schema SCH_B. So, the schema doesn't meet the partition threshold.

Note: To get information about the tables, check the crawler logs in Amazon CloudWatch.

Crawler options

When you customize your crawler behavior, you can choose one of the following options:

Related information

Using crawlers to populate the Data Catalog

Customizing crawler behavior

Defining and managing classifiers

AWS OFFICIALUpdated 4 months ago