Set correct Table level, Include Path and Exclude path.

0

Hello all,

I have a s3 bucket with this following path: s3://a/b/c/products

Inside the products folder I have one folder for each version (each version is a database snapshot of the products table, obtained on a weekly basis by a workflow).

  1. /version_0
    1. _temporary
      1. 0_$folder$
    2. part-00000-c5... ...c000.snappy.parquet
  2. /version_1
    1. _temporary
      1. 0_$folder$
    2. part-00000-29... ...c000.snappy.parquet

I have created a crawler (Include Path is set to the same path mentioned above -s3://a/b/c/products) with the intention of merging all the versions together into 1 table. The schemas of the different partitions are always the same. The structure of the different partitions is also always the same. I have tried with different Table Levels (4, 5 and 6) in the "Grouping Behaviour for S3 Data" section on the Crawler Settings but it always created multiple tables (one table for each version).

The _temporary folder is something automatically generated by the workflow so it seems. I don't know if I have to include this in the exclude path in order for it to work.

What should be the correct Include path, exclude path and table levels in order for me to create only ONE table merging all versions together?

I have checked all your general documentation links about this issue but could you please provide an actual solution for this issue?

1 個回答
0
已接受的答案

The exclude pattern could be of special help here: try using the version*/_temporary** as the exclude pattern.

This would exclude all the unwanted files other than the parquet files.

For the include pattern, use s3://a/b/c/products/'

you would not need to provide a level for this case.

Check "Create single schema for each S3 path"

This would create one table with "version*" as partitions.

Reference: https://docs.aws.amazon.com/glue/latest/dg/crawler-s3-folder-table-partition.html

profile pictureAWS
已回答 2 年前
AWS
專家
已審閱 2 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南