AWS GLUE Crawler Issue

0

Hello, I have been experimenting with Aws glue, and created some crawlers to crawl the data but the behavior wasn't what I expected, Question 1) I had an S3 bucket with 3 folders in which 2 folders had same schema data but the third folder had a diff schema I was expecting it to have 2 table but the crawler created 3 tables. Why this behavior can anyone explain? Question 2) I had 4 different files under a folder with 2 of them having same schema and other 2 having same schema, but the crawler created 4 different tables in data catalog when i expected it to create only 2 please explain this also.

Naman
asked 9 months ago283 views
1 Answer
0

Check this documentation out.

All the following conditions must be true for AWS Glue to create a partitioned table for an Amazon S3 folder:

  • The schemas of the files are similar, as determined by AWS Glue.
  • The data format of the files is the same.
  • The compression format of the files is the same.

On that document, it gives the following example which can likely explain what's happening.

"...you might own an Amazon S3 bucket named my-app-bucket, where you store both iOS and Android app sales data. The data is partitioned by year, month, and day. The data files for iOS and Android sales have the same schema, data format, and compression format. In the AWS Glue Data Catalog, the AWS Glue crawler creates one table definition with partitioning keys for year, month, and day.

The following Amazon S3 listing of my-app-bucket shows some of the partitions. The = symbol is used to assign partition key values.


   my-app-bucket/Sales/year=2010/month=feb/day=1/iOS.csv
   my-app-bucket/Sales/year=2010/month=feb/day=1/Android.csv
   my-app-bucket/Sales/year=2010/month=feb/day=2/iOS.csv
   my-app-bucket/Sales/year=2010/month=feb/day=2/Android.csv
   ...
   my-app-bucket/Sales/year=2017/month=feb/day=4/iOS.csv
   my-app-bucket/Sales/year=2017/month=feb/day=4/Android.csv
   

Sounds like the above conditions are the deciding factor in how AWS Glue defines your table definitions. You can format the data in a way that standardizes on the conditions you are looking for, or manually create a table in the AWS Glue Data Catalog.

Here's a great re:Post article called "How can I prevent the AWS Glue Crawler from creating multiple tables?" that dives into the weeds on how AWS Glue decides the schema and what you can do to stop it from creating multiple tables from a single data source.

AWS
AWSJoe
answered 9 months ago
  • would like to add some context 1) I had these files in my s3 bucket-- my-app-bucket/data-store-db/emp-data/Employee1.csv my-app-bucket/data-store-db/emp-data/Employee2.csv when i ran crawler for my-app-bucket/data-store-db/emp-data/ location it created one table for both the file since they have same schema, format and compression

    2nd)
    When i put the above files in the different folder within the same directory my-app-bucket/data-store-db/emp-data/empfolder1/Employee1.csv my-app-bucket/data-store-db/emp-data/empfolder2/Employee2.csv Crawler gave me a table with a partition on folder level

    3rd) When i put the above files with some more files with different schema into same folder my-app-bucket/data-store-db/emp-data/empfolder1/Employee1.csv my-app-bucket/data-store-db/emp-data/empfolder1/Employee2.csv my-app-bucket/data-store-db/emp-data/empfolder1/Sales1.csv my-app-bucket/data-store-db/emp-data/empfolder1/Sales2.csv

    Employee files have same schema and Sales files have same schema and i was expecting only 2 tables to be created by the crawler but they created 4 tables out of it thats my doubt as to why it created 4 tables and not 2 when 2 files have the same schema so why shouldnt they be clubbed together in the same table

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions