Skip to content

Correct way to generate tables with Crawler?

0

I have csv files stored in S3. The files are named as followed: {region name}{today's date}.csv. There are multiple regions. These files are saved under 'log/year/month/date' directory. So this directory contains 5 csv files if there are 5 regions. I want crawler to make tables for each region, combining all the csv files with same region under '/log'. When I tested with one type of csv files, Crawler was working fine. However, when I saved multiple csv files in the same directory, instead of creating tables for each region, it creates tables for each specific csv files (like europe_2024_06_06_csv table), and all the tables are empty. Is there a way to make Crawler to make tables for each region, containing all the data under '/log'? Edit: I have created another file inside those directories and now it's saved like this: 'log/year/month/date/{region_name}/{region name}{today's date}.csv'. Now, the tables aren't empty, but still creating one table for each csv file like this: north_america, north_america_1bd2406a237179ab84d7698b15f25742, north_america_43a55233fc9076cb292c61de02590a47, north_america_80892ad83736e81faf29db6157afca74, north_america_f71efd6146ad8d1cbf5a385ed99c925f Instead of creating separate tables, I want crawler to combine all the data on those tables and make one table named north_america. Should I just create separate directories for each region and assign each Crawler for each region? (I would like to use one crawler if possible). How can I fix this?

asked 2 years ago748 views
1 Answer
0
Accepted Answer

If you'd like to create tables for each region you should structure your s3 bucket, folders and files in the following manner:

  • s3-bucket-name
    • north_america
      • north_america_date.csv
    • south_america
      • south_america_date.csv
    • region_3
      • region_3_date.csv

When you create a crawler, you can specify multiple sources to crawl. You will need to point your crawler to each top level folder. So one data source would be the "north-america" folder in the s3 bucket and you should point it to that folder, not the file. Ensure to have "crawl all sub-folders" checked. If you also want to partition by year, month, date, then also include those as folders in s3 and then put the file after.

An example s3 path would look like this:

point crawler to: s3://bucket-name/north_america

The crawler will create a table named "north_america" If you added year, month, day folders, the crawler will pick those up and add as partitions as well.

Repeat these steps for each of the 5 regions you want using a single crawler.

Also, make sure Create a single schema for each S3 path is not checked, otherwise it may try to create one table.

When you run the crawler after defining the 5 data sources for the regions, it should create the 5 tables for you, while you only made one crawler.

single schema information - https://docs.aws.amazon.com/glue/latest/dg/crawler-grouping-policy.html

defining crawler - https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html

AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.