"Grouping Behavior for S3 Data" on Crawler does not work.

0

I have a s3 bucket with this following path: s3://a/b/c/products

and inside the products folder I have 2 subfolders called version_0 and version_1. Each version is a "database snapshot" for the products table.

What we needed is to create a crawler that merges the two versions together and adds it to the glue gatalog. So in the crawler properties we have set the s3 bucket path to be the one mentioned above, and then the Output options we have this:

Enter image description here

I already tried 3 different configurations:

  1. Box ticked "Create a single schema for each S3 path" and "Table Level" empty
  2. Box ticked "Create a single schema for each S3 path" and "Table Level" set to 4.
  3. Box ticked "Create a single schema for each S3 path" and "Table Level" set to 5.

Nothing worked and the crawler always created 2 tables (one for each version). I followed this documentation: https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-table-level

  • Hi, do the 2 versions of the table have a similar schema or are there many differences?

2 Answers
0

There are four reasons why a crawler would create separate tables:

  1. The source files might not be having the same type of file (CSV, parquet or JSON) Please check whether the source files in your given folder are of same type or not.

  2. The source files might be of different compression types (snappy, gzip, bzip2) Make sure whether the compression types are of the same type for your source files.

  3. The source files might not be having the same schema. As said in the earlier answer, for the crawler to detect a single schema for both the folders, the threshold of 70% must be met. That means, the similarity of schema between both the sources must be atleast 70%.

  4. The structure of Amazon S3 storage partitions are different for both of the files. *I do not think this might be an issue because the structure of S3 partitions is almost the same. But please do check this.

You can know the exact reason behind your crawler creating multiple tables by checking through your crawler logs. You can login to your console -> select your crawler -> choose logs to view the logs of your crawler.

For more details please follow this article.

profile pictureAWS
SUPPORT ENGINEER
Chaitu
answered 2 years ago
  • Hi Chaitu, Thank you for your answer.

    I have checked everything and all seems good...

    My S3 structure is as follows:

    s3://a/b/c/products

    1. /version_0
      1. _temporary
        1. 0_$folder$
      2. part-00000-c5... ...c000.snappy.parquet
    2. /version_1
      1. _temporary
        1. 0_$folder$
      2. part-00000-29... ...c000.snappy.parquet

    What is making the crawler create multiple tables in this case? The schemas are exactly the same. Tried with Table Levels: 4 5 and 6. Nothing worked...

0

For the schemas to be defined as one schema, there needs to be similarity and a threshold limit. See the examples in the below reference link that talk about partition threshold higher than 70%. In your case, I am assuming there are only 2 schemas for each version that you have.

The crawler infers the schema at folder level and compares the schemas across all folders. If the schemas that are compared match, that is, if the partition threshold is higher than 70%, then the schemas are denoted as partitions of a table. If they don’t match, then the crawler creates a table for each folder, resulting in a higher number of tables.

Reference : https://aws.amazon.com/premiumsupport/knowledge-center/glue-crawler-detect-schema/

profile pictureAWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions