How to create separate Glue Catalog table on single S3 bucket with various different types of files

0

Hi,

I have to create/generate a separate individual Glue Catalog tables by crawaling single S3 bucket based on different types of files.

Suppose, I have S3 bucket as s3://gurpreet-source-bucket/connect/reports/ and insider it, I have three different types of files, so i want to create three different Glue Catalog tables which i can use while querying in Athena.

PowerBiAgentReportwithChannel-2023-07-17T18_00_00Z.csv PowerBiAgentReportwithChannel-2023-07-18T18_00_00Z.csv PowerBiAgentReportwithChannel-2023-07-19T18_00_00Z.csv PowerBiAgentReportwithChannel-2023-07-20T18_00_00Z.csv

PowerBiQueueReportwithChannel-2023-07-17T18_00_00Z.csv PowerBiQueueReportwithChannel-2023-07-18T18_00_00Z.csv PowerBiQueueReportwithChannel-2023-07-19T18_00_00Z.csv PowerBiQueueReportwithChannel-2023-07-20T18_00_00Z.csv

Occupancy-2023-07-17T17_00_00Z.txt Occupancy-2023-07-18T17_00_00Z.txt Occupancy-2023-07-19T17_00_00Z.txt Occupancy-2023-07-20T17_00_00Z.txt

Right, now I created a crawler and pointing data source as s3://gurpreet-source-bucket/connect/reports/ location, but it creates a single table name as "reports" inside Glue database which messed up all the columns name inside single table which I don't want.

So, My requirement is to separate these files into separate tables (say "PowerBiAgentReport" and "PowerBiQueueReport" and "Occupancy") as there are three different types of file those have different structure/metadata.

Can anyone help me to get the idea how to achieve this ?

Thanks, Gurpreet

  • Did you find an answer to this?

asked 8 months ago550 views
2 Answers
0

Hello Gurpreet,

To create separate Glue Catalog tables for different types of files within a single S3 bucket, you can follow these steps:

  1. Create Separate Folders for Each Data Type:

    Organize your files within the S3 bucket into separate folders based on their data types. In your case, you already have files named with prefixes that indicate their type ("PowerBiAgentReport," "PowerBiQueueReport," and "Occupancy"). You can move these files into separate folders like this:

    • s3://gurpreet-source-bucket/connect/reports/PowerBiAgentReport/
    • s3://gurpreet-source-bucket/connect/reports/PowerBiQueueReport/
    • s3://gurpreet-source-bucket/connect/reports/Occupancy/
  2. Create a Glue Crawler for Each Folder:

    Create a separate Glue Crawler for each of the folders you've created. Each crawler should be configured to point to one of the folders and generate a separate table in the Glue Data Catalog.

  3. Configure the Crawlers:

    When configuring the crawlers, you should specify the appropriate database, prefix, and options to ensure that each crawler creates a separate table with the desired schema. For example:

    • Crawler 1 (PowerBiAgentReport):

      • Data store: s3://gurpreet-source-bucket/connect/reports/PowerBiAgentReport/
      • Database: Glue database where you want to create the table (e.g., "MyDatabase")
      • Prefix: (empty or as needed)
      • Schema and table name: Choose options that match your files.
    • Crawler 2 (PowerBiQueueReport):

      • Data store: s3://gurpreet-source-bucket/connect/reports/PowerBiQueueReport/
      • Database: Same Glue database as above ("MyDatabase")
      • Prefix: (empty or as needed)
      • Schema and table name: Choose options that match your files.
    • Crawler 3 (Occupancy):

      • Data store: s3://gurpreet-source-bucket/connect/reports/Occupancy/
      • Database: Same Glue database as above ("MyDatabase")
      • Prefix: (empty or as needed)
      • Schema and table name: Choose options that match your files.
  4. Run the Crawlers:

    Execute each crawler to scan the respective folders and create separate Glue Catalog tables for each data type. You can do this from the AWS Glue Console or by using the AWS CLI or SDK.

  5. Verify the Tables:

    Once the crawlers have completed their runs, go to the Glue Data Catalog to verify that you have separate tables for each data type.

By following these steps, you'll have separate Glue Catalog tables for the different types of files in your S3 bucket, and you can query them individually in Athena or other AWS services as needed.

Please give a thumbs up if my suggestion helps

profile picture
answered 8 months ago
  • Hi Gabriel,

    Thanks for your reply on my question.

    Actually, in my case, source bucket is at another AWS account's control and they won't be able to put the files into separate folders for each different types of files.

    Is there any way to crawel these files(of different types) from a single folder in another account and create different Glue Catalog tables in my account based on the each file type ?

    Thanks, Gurpreet

0

Hi Gabriel,

Thanks for your reply on my question.

Actually, in my case, source bucket is at another AWS account's control and they won't be able to put the files into separate folders for each different types of files.

Is there any way to crawel these files(of different types) from a single folder in another account and create different Glue Catalog tables in my account based on the each file type ?

Thanks, Gurpreet

answered 8 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions