Set correct Table level, Include Path and Exclude path.

0

Hello all,

I have a s3 bucket with this following path: s3://a/b/c/products

Inside the products folder I have one folder for each version (each version is a database snapshot of the products table, obtained on a weekly basis by a workflow).

  1. /version_0
    1. _temporary
      1. 0_$folder$
    2. part-00000-c5... ...c000.snappy.parquet
  2. /version_1
    1. _temporary
      1. 0_$folder$
    2. part-00000-29... ...c000.snappy.parquet

I have created a crawler (Include Path is set to the same path mentioned above -s3://a/b/c/products) with the intention of merging all the versions together into 1 table. The schemas of the different partitions are always the same. The structure of the different partitions is also always the same. I have tried with different Table Levels (4, 5 and 6) in the "Grouping Behaviour for S3 Data" section on the Crawler Settings but it always created multiple tables (one table for each version).

The _temporary folder is something automatically generated by the workflow so it seems. I don't know if I have to include this in the exclude path in order for it to work.

What should be the correct Include path, exclude path and table levels in order for me to create only ONE table merging all versions together?

I have checked all your general documentation links about this issue but could you please provide an actual solution for this issue?

1 Risposta
0
Risposta accettata

The exclude pattern could be of special help here: try using the version*/_temporary** as the exclude pattern.

This would exclude all the unwanted files other than the parquet files.

For the include pattern, use s3://a/b/c/products/'

you would not need to provide a level for this case.

Check "Create single schema for each S3 path"

This would create one table with "version*" as partitions.

Reference: https://docs.aws.amazon.com/glue/latest/dg/crawler-s3-folder-table-partition.html

profile pictureAWS
con risposta 2 anni fa
AWS
ESPERTO
verificato 2 anni fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande