Set correct Table level, Include Path and Exclude path.

0

Hello all,

I have a s3 bucket with this following path: s3://a/b/c/products

Inside the products folder I have one folder for each version (each version is a database snapshot of the products table, obtained on a weekly basis by a workflow).

  1. /version_0
    1. _temporary
      1. 0_$folder$
    2. part-00000-c5... ...c000.snappy.parquet
  2. /version_1
    1. _temporary
      1. 0_$folder$
    2. part-00000-29... ...c000.snappy.parquet

I have created a crawler (Include Path is set to the same path mentioned above -s3://a/b/c/products) with the intention of merging all the versions together into 1 table. The schemas of the different partitions are always the same. The structure of the different partitions is also always the same. I have tried with different Table Levels (4, 5 and 6) in the "Grouping Behaviour for S3 Data" section on the Crawler Settings but it always created multiple tables (one table for each version).

The _temporary folder is something automatically generated by the workflow so it seems. I don't know if I have to include this in the exclude path in order for it to work.

What should be the correct Include path, exclude path and table levels in order for me to create only ONE table merging all versions together?

I have checked all your general documentation links about this issue but could you please provide an actual solution for this issue?

1 Answer
0
Accepted Answer

The exclude pattern could be of special help here: try using the version*/_temporary** as the exclude pattern.

This would exclude all the unwanted files other than the parquet files.

For the include pattern, use s3://a/b/c/products/'

you would not need to provide a level for this case.

Check "Create single schema for each S3 path"

This would create one table with "version*" as partitions.

Reference: https://docs.aws.amazon.com/glue/latest/dg/crawler-s3-folder-table-partition.html

profile pictureAWS
answered 2 years ago
AWS
EXPERT
reviewed 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions