Hello,
I have a s3 bucket with this following path: "s3://a/b/c"
Inside this 'c' folder I have one folder for each table. Then for each of these table folders I have a folder for each version. Each version is a database snapshot obtained on a weekly basis, which is run by a workflow. To clarify, the structure inside 'c' is like this:
- products
- /version_0
- _temporary
- 0_$folder$
- part-00000-c5... ...c000.snappy.parquet
- /version_1
- _temporary
- 0_$folder$
- part-00000-c5... ...c000.snappy.parquet
- locations
- /version_0
- _temporary
- 0_$folder$
- part-00000-c5... ...c000.snappy.parquet
- /version_1
- _temporary
- 0_$folder$
- part-00000-c5... ...c000.snappy.parquet
I have created a crawler (Include Path is set to the same path mentioned above - "s3://a/b/c") with the intention of merging all the versions together into 1 table, for each table (products, locations). The schemas of the different partitions are always the same. The structure of the different partitions is also always the same.
The _temporary folder is something automatically generated by the workflow.
What should be the actual correct Exclude path (to ignore everything in _temporary folder) and maybe set any Table Level in order for me to create only ONE table merging all versions together for each table (products, locations)?
In summary I should have 2 tables:
- products (containing version_0 and version_1 rows)
- locations (containing version_0 and version_1 rows)
I really have no way of testing the exclude patterns. Is there any Sandbox where we can actually test the glob exclude patterns? I have found one online but it doesn't seem to be similar to what AWS is using. I have tried with these exclude patterns but none worked (it still created a table for each table & each version):
- version*/_temporary**
- /**/version*/_temporary**