Set correct Table level, Include Path and Exclude path.

0

Hello all,

I have a s3 bucket with this following path: s3://a/b/c/products

Inside the products folder I have one folder for each version (each version is a database snapshot of the products table, obtained on a weekly basis by a workflow).

  1. /version_0
    1. _temporary
      1. 0_$folder$
    2. part-00000-c5... ...c000.snappy.parquet
  2. /version_1
    1. _temporary
      1. 0_$folder$
    2. part-00000-29... ...c000.snappy.parquet

I have created a crawler (Include Path is set to the same path mentioned above -s3://a/b/c/products) with the intention of merging all the versions together into 1 table. The schemas of the different partitions are always the same. The structure of the different partitions is also always the same. I have tried with different Table Levels (4, 5 and 6) in the "Grouping Behaviour for S3 Data" section on the Crawler Settings but it always created multiple tables (one table for each version).

The _temporary folder is something automatically generated by the workflow so it seems. I don't know if I have to include this in the exclude path in order for it to work.

What should be the correct Include path, exclude path and table levels in order for me to create only ONE table merging all versions together?

I have checked all your general documentation links about this issue but could you please provide an actual solution for this issue?

1개 답변
0
수락된 답변

The exclude pattern could be of special help here: try using the version*/_temporary** as the exclude pattern.

This would exclude all the unwanted files other than the parquet files.

For the include pattern, use s3://a/b/c/products/'

you would not need to provide a level for this case.

Check "Create single schema for each S3 path"

This would create one table with "version*" as partitions.

Reference: https://docs.aws.amazon.com/glue/latest/dg/crawler-s3-folder-table-partition.html

profile pictureAWS
답변함 2년 전
AWS
전문가
검토됨 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠